date()
## [1] "Wed Dec 7 15:31:17 2022"
This IODS2022 is an introductory course to open data science and the open tools commonly used in it.
Link to my GitHub repository: https://github.com/jmleppal/IODS-project
I went through the R for the Health Data Science and got a some grasp on elementary use of R. All in all, the book is an excellent introduction to this topic. I’m trying to understand the intestines of GitHub but at the moment it looks to me as a long path of trial and error. When it comes to R Markdown, working on this visual-mode seems to be easier now in the beginning.
knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Dec 7 15:31:17 2022"
# READING IN THE ORIGINAL DATASET
# Read the data into memory
learning2014 <- read.table("http://www.helsinki.fi/~kvehkala/JYTmooc/JYTOPKYS3-data.txt", sep="\t", header=TRUE)
# Look at the dimensions and structure of the data
dim(learning2014)
## [1] 183 60
# str(learning2014)
The dataset used in this exercise is based on an international survey of students’ approaches to learning in 2013-2015. Measures in the parts A, B and C come from the ASSIST (Approaches and Study Skills Inventory for Students) questionnaire and the part D is based on the SATS (Survey of Attitudes Toward Statistics) questionnaire. Some background and other variables have been omitted, but all 183 responses are included. So all in all, there are 60 variables and 183 observations included. All variables except the one on gender are integers. No values are missing.
# CREATING THE ANALYSIS DATASET
# Access the dplyr library
library(dplyr)
# Create an analysis dataset
deep_questions <- c("D03", "D11", "D19", "D27", "D07", "D14", "D22", "D30","D06", "D15", "D23", "D31")
surface_questions <- c("SU02","SU10","SU18","SU26", "SU05","SU13","SU21","SU29","SU08","SU16","SU24","SU32")
strategic_questions <- c("ST01","ST09","ST17","ST25","ST04","ST12","ST20","ST28")
learn2014an <- learning2014 %>% select(gender, Age, Points)
learn2014an$deep <- rowMeans(learning2014[, deep_questions])
learn2014an$surf <- rowMeans(learning2014[, surface_questions])
learn2014an$stra <- rowMeans(learning2014[, strategic_questions])
# create the column 'attitude' by scaling the column "Attitude"
learn2014an$attitude <- learning2014$Attitude / 10
# print out the column names of the data
colnames(learn2014an)
## [1] "gender" "Age" "Points" "deep" "surf" "stra" "attitude"
# change the names of certain columns
colnames(learn2014an)[2] <- "age"
colnames(learn2014an)[3] <- "points"
# select rows where points is greater than zero
learn2014an <- filter(learn2014an, points > 0)
# Look at the dimensions and structure of the analysis data
# dim(learn2014an)
str(learn2014an)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
In the analysis dataset, variables on gender, age and points are unmodified but the variable on attitude is modified by dividing the original one by 10, and new variables deep, surf and stra are formed by averaging the respective part B variables. Attitude measures the global attitude towards statistics. Deep is a measure on how inclined the participant is towards deep learning, surf is a measure on how superficially the participant approaches his or her studying, and stra is a measure on how organized the participant’s manner is in studying. Points are the points obtained in an exam in a course on Introduction to Social Statistics.
The value 0 in points actually means that the participant did not participate in the exam (n = 17, 9%). These participants are excluded from the analysis dataset. Thus the dataset consists of 7 variables and 166 observations.
# SAVING THE ANALYSIS DATASET AND CHECKING THAT READING ANEW WORKS
# Set the working directory for saving and data
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")
# Access the tidyverse library
library(tidyverse)
# Save the analysis dataset
write.csv(learn2014an,"learn2014an.csv", row.names = FALSE)
# Checking that reading of the saved file works
# learn2014an2 <- read.table("learn2014an.csv", sep=",", header=TRUE)
# str(learn2014an2)
# head(learn2014an2)
Saving and reading the saved file work as expected.
# EXPLORING THE ANALYSIS DATA
# Access the GGally, ggplot2 and cowplot libraries
library(GGally)
library(ggplot2)
# Get summary values
learn2014an %>% group_by(gender) %>% summarise(count= n())
## # A tibble: 2 × 2
## gender count
## <chr> <int>
## 1 F 110
## 2 M 56
summarise_at(learn2014an, vars(attitude:points), tibble::lst(mean, sd))
## attitude_mean stra_mean surf_mean deep_mean points_mean attitude_sd stra_sd
## 1 3.142771 3.121235 2.787149 3.679719 22.71687 0.7299079 0.7718318
## surf_sd deep_sd points_sd
## 1 0.5288405 0.5541369 5.894884
learn2014an %>% group_by(gender) %>% summarise_at(vars(age), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <int> <dbl> <int>
## 1 F 24.9 7.36 17 22 53
## 2 M 26.8 8.43 19 24 55
learn2014an %>% group_by(gender) %>% summarise_at(vars(attitude), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 2.99 0.725 1.4 2.95 5
## 2 M 3.44 0.646 1.7 3.4 4.8
learn2014an %>% group_by(gender) %>% summarise_at(vars(deep), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 3.66 0.535 1.58 3.67 4.75
## 2 M 3.72 0.592 2.08 3.79 4.92
learn2014an %>% group_by(gender) %>% summarise_at(vars(surf), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 2.83 0.458 1.83 2.83 4
## 2 M 2.70 0.642 1.58 2.62 4.33
learn2014an %>% group_by(gender) %>% summarise_at(vars(stra), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 F 3.20 0.748 1.38 3.25 5
## 2 M 2.96 0.801 1.25 3 4.5
learn2014an %>% group_by(gender) %>% summarise_at(vars(points), tibble::lst(mean, sd, min, median, max))
## # A tibble: 2 × 6
## gender mean sd min median max
## <chr> <dbl> <dbl> <int> <dbl> <int>
## 1 F 22.3 5.83 7 23 33
## 2 M 23.5 6.01 9 23.5 33
# Create and draw a more advanced plot matrix with ggpairs()
p <- ggpairs(learn2014an[, c(1, 2, 7, 4, 5, 6, 3)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p
# Initialize plot with data and aesthetic mapping
p1 <- ggplot(learn2014an, aes(x = attitude, y = points))
# Define the visualization type (points) + add a regression line
# + add a main title and draw the plot
p2 <- p1 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's attitude versus exam points")
p2
Out of 166 participants, 56 (34%) are men and 110 (66%) women. The age ranges from 17 to 55 years, the average age (sd) being 26.8 (8.4) years for men and 24.9 (7.4) years for women. Mean scores (sd) are 3.1 (0.7) for attitude, 3.7 (0.6) for deep, 2.8 (0.5) for surf, 3.1 (0.8) for stra and 22.7 (5.9) for points. The distribution of age is skewed to the right, otherwise the distributions are reasonably symmetrical and do not noteworthy differ between gender.
Attitude is significantly correlated with points (rho = 0.44), which association can be seen nicely in the respective plot. On the other hand, surf is slightly negatively (rho -0.14) and stra slightly positively (rho = 0.15) correlated with points, as expected. There is statistically significant negative correlation between surf and attitude (rho = -0.18), deep (rho = -0.32) and stra (rho = -0.16), as expected. Otherwise the associations are unremarkable.
# BUILDING A MULTIPLE REGRESSION MODEL
# Fit a linear model
my_model1 <- lm(points ~ age, data = learn2014an)
my_model2 <- lm(points ~ gender, data = learn2014an)
my_model3 <- lm(points ~ attitude, data = learn2014an)
my_model4 <- lm(points ~ deep, data = learn2014an)
my_model5 <- lm(points ~ surf, data = learn2014an)
my_model6 <- lm(points ~ stra, data = learn2014an)
my_model31 <- lm(points ~ attitude + stra + surf, data = learn2014an)
my_model32 <- lm(points ~ attitude + age + gender, data = learn2014an)
# Print out a summary of the models
summary(my_model1)
##
## Call:
## lm(formula = points ~ age, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.0360 -3.7531 0.0958 4.6762 10.8128
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 24.52150 1.57339 15.585 <2e-16 ***
## age -0.07074 0.05901 -1.199 0.232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008684, Adjusted R-squared: 0.00264
## F-statistic: 1.437 on 1 and 164 DF, p-value: 0.2324
summary(my_model2)
##
## Call:
## lm(formula = points ~ gender, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.3273 -3.3273 0.5179 4.5179 10.6727
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.3273 0.5613 39.776 <2e-16 ***
## genderM 1.1549 0.9664 1.195 0.234
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.887 on 164 degrees of freedom
## Multiple R-squared: 0.008632, Adjusted R-squared: 0.002587
## F-statistic: 1.428 on 1 and 164 DF, p-value: 0.2338
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
summary(my_model4)
##
## Call:
## lm(formula = points ~ deep, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.6913 -3.6935 0.2862 4.9957 10.3537
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 23.1141 3.0908 7.478 4.31e-12 ***
## deep -0.1080 0.8306 -0.130 0.897
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.913 on 164 degrees of freedom
## Multiple R-squared: 0.000103, Adjusted R-squared: -0.005994
## F-statistic: 0.01689 on 1 and 164 DF, p-value: 0.8967
summary(my_model5)
##
## Call:
## lm(formula = points ~ surf, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -14.6539 -3.3744 0.3574 4.4734 10.2234
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.2017 2.4432 11.134 <2e-16 ***
## surf -1.6091 0.8613 -1.868 0.0635 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.851 on 164 degrees of freedom
## Multiple R-squared: 0.02084, Adjusted R-squared: 0.01487
## F-statistic: 3.49 on 1 and 164 DF, p-value: 0.06351
summary(my_model6)
##
## Call:
## lm(formula = points ~ stra, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.5581 -3.8198 0.1042 4.3024 10.1394
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 19.233 1.897 10.141 <2e-16 ***
## stra 1.116 0.590 1.892 0.0603 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.849 on 164 degrees of freedom
## Multiple R-squared: 0.02135, Adjusted R-squared: 0.01538
## F-statistic: 3.578 on 1 and 164 DF, p-value: 0.06031
summary(my_model31)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
summary(my_model32)
##
## Call:
## lm(formula = points ~ attitude + age + gender, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## attitude 3.60657 0.59322 6.080 8.34e-09 ***
## age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
summary(my_model3)
##
## Call:
## lm(formula = points ~ attitude, data = learn2014an)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
# Draw diagnostic plots using the plot() function, choosing the plots 1, 2 and 5
par(mfrow = c(1,3))
plot(my_model3, which = c(1, 2, 5))
In univariate regression models, only attitude has a statistically
significant association with points (p = 4.12e-09). Surf and stra are
marginally associated with points (p = 0.06 for both).
Attitude, surf and stra are selected as explanatory variables for a multiple regression model. In the model, attitude preserves its statistical significance while surf and stra do not add much to it. The overall fitness of the model do not improve: there is practically no change in multiple R-squared (0.21 in the multiple regression and 0.19 in the univariate model) and the overall p actually worsens (3.16e-08 in the multiple regression and 4.12e-09 in the univariate model). Likewise, adding demographic factors age and gender to the model do not improve it.
Therefore, for the final model, only attitude is selected as an explanatory variable. An increase of one unit in attitude is associated with an increase of 3.5 points in the exam results. Based on the multiple R-squared statistics, scores in attitude explained 19 % of the variation in points. Even though this “explanatory power” is rather small, the model was statistically highly significant (F = 38.6 with 1 and 164 df, p = 4.12e-09). This kind of result is rather typical for human behavioral data.
In a linear regression model, the association between the dependant and explanatory variable are assumed to be linear and monotonously increasing or decreasing. The included variables should be normally distributed and the explanatory variables uncorrelated with each other. The diagnostic plots did not show any noteworthy deviance; there were a couple of outliers that did not seem to distort the association between attitude and points. The distributions of both attitude and points were reasonably symmetrical and close enough to normal. In the scatter plot on points by attitude, the association seemed clearly monotonously increasing.
knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Dec 7 15:31:35 2022"
# READING IN THE DATASET
# set the working directory
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")
# access the tidyverse library
library(tidyverse)
# read the data
alc <- read.table("alc.csv", sep = ",", header = TRUE)
# check structure and dimensions
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use"
# glimpse(alc)
The data is based on two surveys concerning student achievement in secondary education in two Portuguese schools. The dataset contains factors related to performances in mathematics and Portuguese language. The variables failures, paid, absences, G1, G2 and G3 present a combined assessment on both subjects. The analyses focus on the relationship between alcohol consumption (alc_use; 1 very low to 5 very high) and other variables. Alcohol use is considered high if it exceeds 2 (high_use).
All in all, there are 36 variables and 370 observations included. No values are missing.
As for primary hypotheses, I would expect going out, failures and absences to be positively and final grade (G3) to be negatively associated with alcohol use.
# EXPLORING THE DATA
# access the GGally and ggplot2 libraries
library(GGally); library(ggplot2)
# set summary values
summary(alc[, c(4, 8:9, 14:15, 23:25, 28:29, 31, 34:35)])
## age Medu Fedu traveltime studytime
## Min. :15.00 Min. :0.0 Min. :0.000 Min. :1.000 Min. :1.000
## 1st Qu.:16.00 1st Qu.:2.0 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:1.000
## Median :17.00 Median :3.0 Median :3.000 Median :1.000 Median :2.000
## Mean :16.58 Mean :2.8 Mean :2.557 Mean :1.446 Mean :2.043
## 3rd Qu.:17.00 3rd Qu.:4.0 3rd Qu.:3.750 3rd Qu.:2.000 3rd Qu.:2.000
## Max. :22.00 Max. :4.0 Max. :4.000 Max. :4.000 Max. :4.000
## famrel freetime goout health
## Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000
## 1st Qu.:4.000 1st Qu.:3.000 1st Qu.:2.000 1st Qu.:3.000
## Median :4.000 Median :3.000 Median :3.000 Median :4.000
## Mean :3.935 Mean :3.224 Mean :3.116 Mean :3.562
## 3rd Qu.:5.000 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:5.000
## Max. :5.000 Max. :5.000 Max. :5.000 Max. :5.000
## failures absences G3 alc_use
## Min. :0.0000 Min. : 0.000 Min. : 0.00 Min. :1.000
## 1st Qu.:0.0000 1st Qu.: 1.000 1st Qu.:10.00 1st Qu.:1.000
## Median :0.0000 Median : 3.000 Median :12.00 Median :1.500
## Mean :0.1892 Mean : 4.511 Mean :11.52 Mean :1.889
## 3rd Qu.:0.0000 3rd Qu.: 6.000 3rd Qu.:14.00 3rd Qu.:2.500
## Max. :3.0000 Max. :45.000 Max. :18.00 Max. :5.000
alc %>% group_by(sex) %>% summarise(count= n())
## # A tibble: 2 × 2
## sex count
## <chr> <int>
## 1 F 195
## 2 M 175
alc %>% group_by(high_use) %>% summarise(count= n())
## # A tibble: 2 × 2
## high_use count
## <lgl> <int>
## 1 FALSE 259
## 2 TRUE 111
alc %>% group_by(sex, high_use) %>% summarise(count= n())
## # A tibble: 4 × 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
p1 <- ggpairs(alc[, c(35, 4, 8:9, 14:15, 23:25)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p2 <- ggpairs(alc[, c(35, 28:29, 31, 34)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p3 <- ggpairs(alc[, c(36, 4, 8:9, 14:15, 23:25)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p4 <- ggpairs(alc[, c(36, 28:29, 31, 34:35)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p1; p2; p3; p4
# initialize plot with data and aesthetic mapping
p5 <- ggplot(alc, aes(x = goout, y = alc_use))
p6 <- ggplot(alc, aes(x = failures, y = alc_use))
p7 <- ggplot(alc, aes(x = absences, y = alc_use))
p8 <- ggplot(alc, aes(x = G3, y = alc_use))
# define the visualization type (points) + add a regression line
# + add a main title and draw the plot
p9 <- p5 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's going out versus alcohol use")
p10 <- p6 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's failures versus alcohol use")
p11 <- p7 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's absences versus alcohol use")
p12 <- p8 + geom_point() + geom_smooth(method = "lm") + ggtitle("Student's G3 versus alcohol use")
p9; p10; p11; p12
p13 <- ggpairs(alc[, c(3:4, 25, 29, 31, 34)], mapping = aes(), lower = list(combo = wrap("facethist", bins = 20)), progress = FALSE)
p13;
Out of 370 participants, 175 (47%) are men and 195 (52%) women. The median age is 17 years, ranging from 15 to 22 years. 111 (30%) participants use alcohol in high quantities; 70 (40%) men and 41 (21%) women are high users. The distributions (median, range) of going out (3, 1-5) and final grade (3, 1-5) are reasonably symmetrical but failures (0, 0-3) and absences (3, 0-45) are highly skewed to the right.
As expected, in initial exploration, going out (rho = 0.40), failures (rho = 0.21) and absences (rho = 0.21) are positively and final grade (rho = -0.16) negatively associated with alcohol use. When alcohol use is dichotomized, these relationships seem to become attenuated. On the other hand, going out (rho = -0.15), failures (rho = -0.36) and absences (rho = -0.10) are negatively associated with final grade. Futhermore, going out is positively associated with failures (rho = 0.14) and absences (rho = 0.11).
# BUILDING A LOGISTIC REGRESSION MODEL
# find the model with glm()
m1 <- glm(high_use ~ age, data = alc, family = "binomial")
m2 <- glm(high_use ~ sex, data = alc, family = "binomial")
m3 <- glm(high_use ~ goout, data = alc, family = "binomial")
m4 <- glm(high_use ~ failures, data = alc, family = "binomial")
m5 <- glm(high_use ~ absences, data = alc, family = "binomial")
m6 <- glm(high_use ~ G3, data = alc, family = "binomial")
m7 <- glm(high_use ~ age + sex, data = alc, family = "binomial")
m8 <- glm(high_use ~ goout + failures + absences + G3, data = alc, family = "binomial")
m9 <- glm(high_use ~ age + sex + goout + failures + absences, data = alc, family = "binomial")
m10 <- glm(high_use ~ sex + goout + failures + absences, data = alc, family = "binomial")
# print out a summary of the model
summary(m1); summary(m2); summary(m3); summary(m4); summary(m5);
##
## Call:
## glm(formula = high_use ~ age, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0968 -0.8705 -0.8020 1.4319 1.6943
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.07538 1.60980 -2.532 0.0114 *
## age 0.19414 0.09628 2.016 0.0438 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 447.94 on 368 degrees of freedom
## AIC: 451.94
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ sex, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0108 -1.0108 -0.6871 1.3537 1.7660
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3234 0.1757 -7.530 5.06e-14 ***
## sexM 0.9179 0.2339 3.925 8.67e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 436.13 on 368 degrees of freedom
## AIC: 440.13
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3712 -0.7687 -0.5470 0.9952 2.3037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3368 0.4188 -7.968 1.62e-15 ***
## goout 0.7563 0.1165 6.494 8.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 403.05 on 368 degrees of freedom
## AIC: 407.05
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ failures, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6367 -0.7943 -0.7943 1.3143 1.6171
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.9920 0.1234 -8.040 8.99e-16 ***
## failures 0.6759 0.1973 3.425 0.000615 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 439.75 on 368 degrees of freedom
## AIC: 443.75
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.3598 -0.8209 -0.7318 1.2658 1.7419
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.26945 0.16094 -7.888 3.08e-15 ***
## absences 0.08867 0.02317 3.827 0.00013 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 434.52 on 368 degrees of freedom
## AIC: 438.52
##
## Number of Fisher Scoring iterations: 4
summary(m6); summary(m7); summary(m8); summary(m9); summary(m10);
##
## Call:
## glm(formula = high_use ~ G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.2365 -0.8551 -0.7651 1.4213 1.7731
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.13783 0.40374 0.341 0.7328
## G3 -0.08689 0.03466 -2.507 0.0122 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 445.68 on 368 degrees of freedom
## AIC: 449.68
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ age + sex, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3034 -0.8870 -0.7086 1.2275 1.9163
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.72614 1.65095 -2.863 0.0042 **
## age 0.20425 0.09812 2.082 0.0374 *
## sexM 0.93259 0.23552 3.960 7.51e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 431.73 on 367 degrees of freedom
## AIC: 437.73
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ goout + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0077 -0.7367 -0.5427 0.9191 2.3664
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.42137 0.69467 -4.925 8.43e-07 ***
## goout 0.69791 0.11881 5.874 4.25e-09 ***
## failures 0.52078 0.23720 2.195 0.028128 *
## absences 0.07373 0.02233 3.303 0.000958 ***
## G3 -0.01612 0.04171 -0.386 0.699253
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 383.91 on 365 degrees of freedom
## AIC: 393.91
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ age + sex + goout + failures + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1461 -0.7853 -0.5274 0.7258 2.3609
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.72724 1.89765 -2.491 0.012735 *
## age 0.03623 0.11375 0.318 0.750115
## sexM 0.98157 0.26166 3.751 0.000176 ***
## goout 0.69397 0.12157 5.708 1.14e-08 ***
## failures 0.47932 0.23331 2.054 0.039930 *
## absences 0.08171 0.02276 3.590 0.000331 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.40 on 364 degrees of freedom
## AIC: 381.4
##
## Number of Fisher Scoring iterations: 4
##
## Call:
## glm(formula = high_use ~ sex + goout + failures + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.1521 -0.7929 -0.5317 0.7412 2.3706
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.14389 0.47881 -8.654 < 2e-16 ***
## sexM 0.97989 0.26156 3.746 0.000179 ***
## goout 0.69801 0.12093 5.772 7.83e-09 ***
## failures 0.48932 0.23073 2.121 0.033938 *
## absences 0.08246 0.02266 3.639 0.000274 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 369.50 on 365 degrees of freedom
## AIC: 379.5
##
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m1); coef(m2); coef(m3); coef(m4); coef(m5);
## (Intercept) age
## -4.0753783 0.1941355
## (Intercept) sexM
## -1.3233805 0.9179154
## (Intercept) goout
## -3.3368302 0.7563474
## (Intercept) failures
## -0.9920231 0.6758788
## (Intercept) absences
## -1.26944532 0.08866504
coef(m6); coef(m7); coef(m8); coef(m9); coef(m10);
## (Intercept) G3
## 0.1378294 -0.0868889
## (Intercept) age sexM
## -4.7261439 0.2042451 0.9325859
## (Intercept) goout failures absences G3
## -3.42137467 0.69790726 0.52077970 0.07373480 -0.01611567
## (Intercept) age sexM goout failures absences
## -4.72723868 0.03622756 0.98157242 0.69396783 0.47932468 0.08171370
## (Intercept) sexM goout failures absences
## -4.14389330 0.97988506 0.69801208 0.48932427 0.08245946
# comparing models
m82 <- glm(high_use ~ goout + failures + absences, data = alc, family = "binomial")
anova(m8, m82, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ goout + failures + absences + G3
## Model 2: high_use ~ goout + failures + absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 365 383.91
## 2 366 384.06 -1 -0.14891 0.6996
anova(m9, m10, test="LRT")
## Analysis of Deviance Table
##
## Model 1: high_use ~ age + sex + goout + failures + absences
## Model 2: high_use ~ sex + goout + failures + absences
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 364 369.4
## 2 365 369.5 -1 -0.10135 0.7502
# compute odds ratios (OR) and confidence intervals (CI) for the final model
OR10 <- coef(m10) %>% exp %>% round(digits = 2)
CI10 <- confint(m10) %>% exp %>% round(digits = 2)
# print out the odds ratios with their confidence intervals
cbind(OR10, CI10)
## OR10 2.5 % 97.5 %
## (Intercept) 0.02 0.01 0.04
## sexM 2.66 1.61 4.49
## goout 2.01 1.59 2.56
## failures 1.63 1.04 2.59
## absences 1.09 1.04 1.14
In univariate logistic regression models, all selected explanatory variables as well as age and sex as demographic factors are statistically significantly associated with high alcohol use. In a multiple regression model, however, final grade and age are not any more significantly associated with high alcohol consumption. Likelihood ratio tests being insignificant also favor leaving them out of the final model. The point estimates of included explanatory variables in the final model are nearly the same as in the respective univariate models.
In the final model, one unit increase in going out increased the odds of using alcohol at high level by 101% (59%-156%), in failures by 63% (4%-159%) and in absences by 9% (4%-14%). Male sex increased the odds (95% CI) by 166% (61%-349%).
# PREDICTIVE ABILITY OF THE FINAL MODEL
## access dplyr and ggplot2
library(dplyr); library(ggplot2)
# fit the model
m10 <- glm(high_use ~ sex + goout + failures + absences, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m10, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% addmargins %>% round(digits = 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 243 16 259
## TRUE 61 50 111
## Sum 304 66 370
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins %>% round(digits = 2)
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66 0.04 0.70
## TRUE 0.16 0.14 0.30
## Sum 0.82 0.18 1.00
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = high_use, y = probability, col = prediction))
# define the geom as points and draw the plot
g + geom_point()
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = 0) %>% round(digits = 2)
## [1] 0.3
loss_func(class = alc$high_use, prob = 1) %>% round(digits = 2)
## [1] 0.7
loss_func(class = alc$high_use, prob = alc$probability) %>% round(digits = 2)
## [1] 0.21
The overall accuracy of the model (i.e., accurate classification) is high: 293 (79%) participants are correctly classified. There is not much difference in the prediction in between the user levels: out of those predicted to be low users of alcohol 243 (80%) are correctly classified whereas out of those predicted to be high users 61 (76%) are truly high users.
The model does increase the predictive performance compared to random guessing: 21% are inaccurately classified by the model, whereas 30% are inaccurately classified by randomly guessing that probability is 0 for all and 70 % by a randomly guessing that probability is 1 for all.
# BONUS SECTION: ON CROSS-VALIDATION
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m10, K = nrow(alc) / 10)
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2108108
The final model does have a slightly better test set performance, error being 0.21 compared with the respective error produced by the model in the Exercise Set (0.26).
# SUPER-BONUS SECTION: ON CROSS-VALIDATION
# I'm sure there is a much more elegant way of writing this R code...
# different models and their summaries
m21 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + schoolsup + famsup + paid + activities + nursery + higher + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m21)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Fedu + Mjob + guardian + school + reason + schoolsup +
## famsup + paid + activities + nursery + higher + internet +
## romantic + famrel + studytime + traveltime + freetime + goout +
## failures + absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7733 -0.6942 -0.3703 0.5275 2.8698
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.06934 2.95252 -1.040 0.298541
## age 0.05401 0.14774 0.366 0.714696
## sexM 0.86707 0.33018 2.626 0.008639 **
## health 0.19978 0.11237 1.778 0.075433 .
## addressU -0.85388 0.41398 -2.063 0.039149 *
## famsizeLE3 0.50174 0.32064 1.565 0.117621
## PstatusT -0.21412 0.51220 -0.418 0.675910
## Fedu 0.04477 0.15749 0.284 0.776218
## Mjobhealth -0.86921 0.66777 -1.302 0.193033
## Mjobother -0.76813 0.46827 -1.640 0.100929
## Mjobservices -0.65342 0.51502 -1.269 0.204540
## Mjobteacher -0.36452 0.57503 -0.634 0.526139
## guardianmother -0.81573 0.36120 -2.258 0.023921 *
## guardianother -0.02361 0.77809 -0.030 0.975790
## schoolMS -0.41859 0.55098 -0.760 0.447418
## reasonhome 0.43518 0.38491 1.131 0.258223
## reasonother 1.33386 0.52127 2.559 0.010501 *
## reasonreputation -0.05167 0.40177 -0.129 0.897677
## schoolsupyes 0.07500 0.45298 0.166 0.868493
## famsupyes -0.15239 0.32301 -0.472 0.637080
## paidyes 0.64657 0.31745 2.037 0.041673 *
## activitiesyes -0.54344 0.30862 -1.761 0.078264 .
## nurseryyes -0.46789 0.36303 -1.289 0.197453
## higheryes -0.17828 0.71758 -0.248 0.803787
## internetyes 0.21525 0.44485 0.484 0.628467
## romanticyes -0.53828 0.32531 -1.655 0.097993 .
## famrel -0.52687 0.16561 -3.181 0.001465 **
## studytime -0.28025 0.20271 -1.382 0.166824
## traveltime 0.25226 0.22899 1.102 0.270621
## freetime 0.23073 0.16392 1.408 0.159248
## goout 0.87900 0.15327 5.735 9.75e-09 ***
## failures 0.29553 0.27415 1.078 0.281042
## absences 0.09199 0.02614 3.519 0.000433 ***
## G3 0.02725 0.05191 0.525 0.599637
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 318.90 on 336 degrees of freedom
## AIC: 386.9
##
## Number of Fisher Scoring iterations: 5
m22 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + famsup + paid + activities + nursery + higher + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m22)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Fedu + Mjob + guardian + school + reason + famsup +
## paid + activities + nursery + higher + internet + romantic +
## famrel + studytime + traveltime + freetime + goout + failures +
## absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7609 -0.6940 -0.3720 0.5267 2.8647
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.96925 2.89116 -1.027 0.304415
## age 0.04918 0.14482 0.340 0.734150
## sexM 0.85938 0.32682 2.630 0.008551 **
## health 0.19904 0.11228 1.773 0.076282 .
## addressU -0.84988 0.41350 -2.055 0.039849 *
## famsizeLE3 0.49997 0.32044 1.560 0.118693
## PstatusT -0.21693 0.51219 -0.424 0.671902
## Fedu 0.04554 0.15741 0.289 0.772346
## Mjobhealth -0.87549 0.66660 -1.313 0.189056
## Mjobother -0.77016 0.46819 -1.645 0.099975 .
## Mjobservices -0.65192 0.51482 -1.266 0.205401
## Mjobteacher -0.36879 0.57448 -0.642 0.520900
## guardianmother -0.81692 0.36120 -2.262 0.023718 *
## guardianother -0.02329 0.77775 -0.030 0.976110
## schoolMS -0.42351 0.55004 -0.770 0.441325
## reasonhome 0.43646 0.38496 1.134 0.256889
## reasonother 1.33457 0.52098 2.562 0.010417 *
## reasonreputation -0.05037 0.40179 -0.125 0.900245
## famsupyes -0.15014 0.32279 -0.465 0.641833
## paidyes 0.64340 0.31691 2.030 0.042334 *
## activitiesyes -0.53930 0.30764 -1.753 0.079599 .
## nurseryyes -0.46621 0.36269 -1.285 0.198650
## higheryes -0.17380 0.71753 -0.242 0.808610
## internetyes 0.21588 0.44490 0.485 0.627516
## romanticyes -0.53981 0.32514 -1.660 0.096873 .
## famrel -0.52685 0.16560 -3.181 0.001465 **
## studytime -0.27968 0.20274 -1.380 0.167738
## traveltime 0.25567 0.22830 1.120 0.262772
## freetime 0.23063 0.16394 1.407 0.159502
## goout 0.87846 0.15325 5.732 9.9e-09 ***
## failures 0.29549 0.27438 1.077 0.281504
## absences 0.09192 0.02612 3.519 0.000433 ***
## G3 0.02598 0.05130 0.506 0.612585
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 318.93 on 337 degrees of freedom
## AIC: 384.93
##
## Number of Fisher Scoring iterations: 5
m23 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Fedu + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m23)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Fedu + Mjob + guardian + school + reason + famsup +
## paid + activities + nursery + internet + romantic + famrel +
## studytime + traveltime + freetime + goout + failures + absences +
## G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7885 -0.6919 -0.3695 0.5206 2.8581
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.19543 2.73368 -1.169 0.242440
## age 0.05587 0.14190 0.394 0.693793
## sexM 0.86745 0.32512 2.668 0.007629 **
## health 0.19870 0.11227 1.770 0.076750 .
## addressU -0.85349 0.41317 -2.066 0.038858 *
## famsizeLE3 0.49873 0.32026 1.557 0.119410
## PstatusT -0.21809 0.51176 -0.426 0.669994
## Fedu 0.04169 0.15666 0.266 0.790163
## Mjobhealth -0.88016 0.66551 -1.323 0.185988
## Mjobother -0.77519 0.46717 -1.659 0.097050 .
## Mjobservices -0.65521 0.51392 -1.275 0.202341
## Mjobteacher -0.37101 0.57428 -0.646 0.518248
## guardianmother -0.81986 0.36078 -2.272 0.023060 *
## guardianother -0.03054 0.77805 -0.039 0.968686
## schoolMS -0.43916 0.54672 -0.803 0.421826
## reasonhome 0.43014 0.38412 1.120 0.262790
## reasonother 1.33275 0.52008 2.563 0.010389 *
## reasonreputation -0.05181 0.40164 -0.129 0.897366
## famsupyes -0.15501 0.32206 -0.481 0.630290
## paidyes 0.63438 0.31470 2.016 0.043817 *
## activitiesyes -0.54909 0.30503 -1.800 0.071840 .
## nurseryyes -0.46197 0.36200 -1.276 0.201902
## internetyes 0.22716 0.44252 0.513 0.607727
## romanticyes -0.52965 0.32216 -1.644 0.100163
## famrel -0.52666 0.16560 -3.180 0.001471 **
## studytime -0.28176 0.20239 -1.392 0.163879
## traveltime 0.25472 0.22814 1.117 0.264205
## freetime 0.22989 0.16397 1.402 0.160909
## goout 0.87855 0.15327 5.732 9.93e-09 ***
## failures 0.30386 0.27304 1.113 0.265750
## absences 0.09209 0.02602 3.540 0.000401 ***
## G3 0.02346 0.05022 0.467 0.640372
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 318.98 on 338 degrees of freedom
## AIC: 382.98
##
## Number of Fisher Scoring iterations: 5
m24 <- glm(high_use ~ age + sex + health + address + famsize + Pstatus + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m24)
##
## Call:
## glm(formula = high_use ~ age + sex + health + address + famsize +
## Pstatus + Mjob + guardian + school + reason + famsup + paid +
## activities + nursery + internet + romantic + famrel + studytime +
## traveltime + freetime + goout + failures + absences + G3,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8024 -0.6867 -0.3726 0.5270 2.8586
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.05049 2.67834 -1.139 0.254725
## age 0.05400 0.14173 0.381 0.703172
## sexM 0.86914 0.32518 2.673 0.007522 **
## health 0.19961 0.11211 1.781 0.074992 .
## addressU -0.85786 0.41243 -2.080 0.037526 *
## famsizeLE3 0.48881 0.31803 1.537 0.124296
## PstatusT -0.23584 0.50787 -0.464 0.642377
## Mjobhealth -0.86508 0.66285 -1.305 0.191865
## Mjobother -0.78793 0.46436 -1.697 0.089737 .
## Mjobservices -0.65306 0.51381 -1.271 0.203720
## Mjobteacher -0.33772 0.56006 -0.603 0.546508
## guardianmother -0.84113 0.35206 -2.389 0.016887 *
## guardianother -0.03532 0.77788 -0.045 0.963782
## schoolMS -0.43365 0.54688 -0.793 0.427804
## reasonhome 0.43122 0.38402 1.123 0.261480
## reasonother 1.33397 0.52063 2.562 0.010400 *
## reasonreputation -0.04724 0.40116 -0.118 0.906266
## famsupyes -0.14646 0.32032 -0.457 0.647509
## paidyes 0.63330 0.31460 2.013 0.044111 *
## activitiesyes -0.54591 0.30486 -1.791 0.073341 .
## nurseryyes -0.44785 0.35820 -1.250 0.211199
## internetyes 0.23102 0.44189 0.523 0.601110
## romanticyes -0.52520 0.32146 -1.634 0.102303
## famrel -0.52726 0.16571 -3.182 0.001463 **
## studytime -0.28577 0.20165 -1.417 0.156434
## traveltime 0.24871 0.22696 1.096 0.273141
## freetime 0.22620 0.16335 1.385 0.166121
## goout 0.88397 0.15201 5.815 6.06e-09 ***
## failures 0.28968 0.26763 1.082 0.279066
## absences 0.09278 0.02592 3.579 0.000345 ***
## G3 0.02455 0.05001 0.491 0.623489
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.05 on 339 degrees of freedom
## AIC: 381.05
##
## Number of Fisher Scoring iterations: 5
m25 <- glm(high_use ~ sex + health + address + famsize + Pstatus + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m25)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Pstatus +
## Mjob + guardian + school + reason + famsup + paid + activities +
## nursery + internet + romantic + famrel + studytime + traveltime +
## freetime + goout + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8115 -0.6780 -0.3741 0.5351 2.8662
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.18345 1.40586 -1.553 0.120398
## sexM 0.87542 0.32423 2.700 0.006935 **
## health 0.19812 0.11194 1.770 0.076744 .
## addressU -0.87032 0.41091 -2.118 0.034173 *
## famsizeLE3 0.48549 0.31783 1.528 0.126636
## PstatusT -0.23136 0.50751 -0.456 0.648477
## Mjobhealth -0.86599 0.66254 -1.307 0.191187
## Mjobother -0.78347 0.46470 -1.686 0.091799 .
## Mjobservices -0.65478 0.51433 -1.273 0.202993
## Mjobteacher -0.33410 0.55988 -0.597 0.550689
## guardianmother -0.83727 0.35200 -2.379 0.017378 *
## guardianother 0.03273 0.75903 0.043 0.965609
## schoolMS -0.36884 0.52019 -0.709 0.478288
## reasonhome 0.42938 0.38416 1.118 0.263691
## reasonother 1.31806 0.51857 2.542 0.011030 *
## reasonreputation -0.04819 0.40091 -0.120 0.904328
## famsupyes -0.15410 0.31946 -0.482 0.629540
## paidyes 0.63749 0.31432 2.028 0.042544 *
## activitiesyes -0.54843 0.30457 -1.801 0.071751 .
## nurseryyes -0.43975 0.35738 -1.231 0.218509
## internetyes 0.22314 0.44134 0.506 0.613142
## romanticyes -0.50541 0.31711 -1.594 0.110986
## famrel -0.52404 0.16539 -3.169 0.001532 **
## studytime -0.28015 0.20040 -1.398 0.162138
## traveltime 0.24680 0.22724 1.086 0.277444
## freetime 0.22240 0.16299 1.365 0.172397
## goout 0.88971 0.15127 5.881 4.07e-09 ***
## failures 0.30230 0.26479 1.142 0.253582
## absences 0.09363 0.02576 3.635 0.000278 ***
## G3 0.02386 0.05003 0.477 0.633404
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.20 on 340 degrees of freedom
## AIC: 379.2
##
## Number of Fisher Scoring iterations: 5
m26 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + famsup + paid + activities + nursery + internet + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m26)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + famsup + paid + activities +
## nursery + internet + romantic + famrel + studytime + traveltime +
## freetime + goout + failures + absences + G3, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8257 -0.6871 -0.3759 0.5287 2.8577
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.39445 1.32839 -1.803 0.071465 .
## sexM 0.87214 0.32419 2.690 0.007141 **
## health 0.19659 0.11203 1.755 0.079284 .
## addressU -0.87215 0.41028 -2.126 0.033526 *
## famsizeLE3 0.49404 0.31668 1.560 0.118747
## Mjobhealth -0.85441 0.66232 -1.290 0.197039
## Mjobother -0.75819 0.46117 -1.644 0.100168
## Mjobservices -0.63041 0.51137 -1.233 0.217651
## Mjobteacher -0.31297 0.55700 -0.562 0.574195
## guardianmother -0.82015 0.34963 -2.346 0.018987 *
## guardianother 0.07088 0.76004 0.093 0.925699
## schoolMS -0.37703 0.51952 -0.726 0.468011
## reasonhome 0.42329 0.38397 1.102 0.270289
## reasonother 1.31565 0.51775 2.541 0.011051 *
## reasonreputation -0.04565 0.40123 -0.114 0.909410
## famsupyes -0.15297 0.31920 -0.479 0.631787
## paidyes 0.62587 0.31306 1.999 0.045585 *
## activitiesyes -0.55798 0.30363 -1.838 0.066104 .
## nurseryyes -0.42608 0.35608 -1.197 0.231472
## internetyes 0.20904 0.44029 0.475 0.634943
## romanticyes -0.50246 0.31704 -1.585 0.113007
## famrel -0.52594 0.16579 -3.172 0.001512 **
## studytime -0.28026 0.20035 -1.399 0.161869
## traveltime 0.24293 0.22631 1.073 0.283073
## freetime 0.21494 0.16181 1.328 0.184067
## goout 0.88814 0.15097 5.883 4.04e-09 ***
## failures 0.30053 0.26510 1.134 0.256940
## absences 0.09458 0.02555 3.702 0.000214 ***
## G3 0.02611 0.04981 0.524 0.600062
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.40 on 341 degrees of freedom
## AIC: 377.4
##
## Number of Fisher Scoring iterations: 5
m27 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + famsup + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m27)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + famsup + paid + activities +
## nursery + romantic + famrel + studytime + traveltime + freetime +
## goout + failures + absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8328 -0.6796 -0.3747 0.5404 2.8683
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.32240 1.31885 -1.761 0.078251 .
## sexM 0.87340 0.32425 2.694 0.007068 **
## health 0.19373 0.11162 1.736 0.082649 .
## addressU -0.83448 0.40234 -2.074 0.038073 *
## famsizeLE3 0.49752 0.31629 1.573 0.115724
## Mjobhealth -0.79381 0.65218 -1.217 0.223542
## Mjobother -0.71259 0.45084 -1.581 0.113970
## Mjobservices -0.57330 0.49752 -1.152 0.249191
## Mjobteacher -0.25368 0.54375 -0.467 0.640826
## guardianmother -0.82837 0.34921 -2.372 0.017686 *
## guardianother 0.05555 0.75518 0.074 0.941360
## schoolMS -0.37018 0.51814 -0.714 0.474950
## reasonhome 0.41920 0.38397 1.092 0.274942
## reasonother 1.31310 0.51752 2.537 0.011172 *
## reasonreputation -0.04511 0.40044 -0.113 0.910317
## famsupyes -0.14499 0.31820 -0.456 0.648633
## paidyes 0.63651 0.31213 2.039 0.041427 *
## activitiesyes -0.54865 0.30311 -1.810 0.070282 .
## nurseryyes -0.44357 0.35358 -1.255 0.209647
## romanticyes -0.49066 0.31597 -1.553 0.120454
## famrel -0.52478 0.16579 -3.165 0.001549 **
## studytime -0.27793 0.20027 -1.388 0.165190
## traveltime 0.24304 0.22637 1.074 0.282979
## freetime 0.21735 0.16147 1.346 0.178275
## goout 0.89168 0.15095 5.907 3.48e-09 ***
## failures 0.29404 0.26526 1.109 0.267637
## absences 0.09631 0.02527 3.811 0.000139 ***
## G3 0.02666 0.04981 0.535 0.592451
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.63 on 342 degrees of freedom
## AIC: 375.63
##
## Number of Fisher Scoring iterations: 5
m28 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences + G3, data = alc, family = "binomial")
summary(m28)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + paid + activities + nursery +
## romantic + famrel + studytime + traveltime + freetime + goout +
## failures + absences + G3, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8383 -0.6862 -0.3765 0.5306 2.8548
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.38141 1.31082 -1.817 0.069258 .
## sexM 0.89197 0.32200 2.770 0.005605 **
## health 0.18934 0.11099 1.706 0.088019 .
## addressU -0.82360 0.40107 -2.054 0.040024 *
## famsizeLE3 0.50629 0.31542 1.605 0.108462
## Mjobhealth -0.83338 0.64562 -1.291 0.196763
## Mjobother -0.70548 0.45022 -1.567 0.117122
## Mjobservices -0.58517 0.49658 -1.178 0.238640
## Mjobteacher -0.27326 0.54111 -0.505 0.613565
## guardianmother -0.82408 0.34927 -2.359 0.018302 *
## guardianother 0.03067 0.75544 0.041 0.967611
## schoolMS -0.33536 0.51293 -0.654 0.513234
## reasonhome 0.41271 0.38353 1.076 0.281888
## reasonother 1.32527 0.51650 2.566 0.010292 *
## reasonreputation -0.04696 0.39985 -0.117 0.906507
## paidyes 0.60780 0.30520 1.991 0.046429 *
## activitiesyes -0.54458 0.30288 -1.798 0.072176 .
## nurseryyes -0.44167 0.35364 -1.249 0.211699
## romanticyes -0.49720 0.31594 -1.574 0.115551
## famrel -0.52100 0.16548 -3.148 0.001642 **
## studytime -0.28997 0.19876 -1.459 0.144587
## traveltime 0.24073 0.22604 1.065 0.286887
## freetime 0.21119 0.16069 1.314 0.188762
## goout 0.89196 0.15073 5.918 3.27e-09 ***
## failures 0.29978 0.26516 1.131 0.258244
## absences 0.09569 0.02512 3.810 0.000139 ***
## G3 0.02776 0.04978 0.558 0.577025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 319.84 on 343 degrees of freedom
## AIC: 373.84
##
## Number of Fisher Scoring iterations: 5
m29 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + school + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m29)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + school + reason + paid + activities + nursery +
## romantic + famrel + studytime + traveltime + freetime + goout +
## failures + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8834 -0.6840 -0.3776 0.5414 2.8104
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.05408 1.16884 -1.757 0.078855 .
## sexM 0.88825 0.32061 2.771 0.005597 **
## health 0.18175 0.11005 1.652 0.098633 .
## addressU -0.80345 0.39831 -2.017 0.043681 *
## famsizeLE3 0.52116 0.31421 1.659 0.097196 .
## Mjobhealth -0.78493 0.63846 -1.229 0.218919
## Mjobother -0.69892 0.44936 -1.555 0.119860
## Mjobservices -0.54356 0.48975 -1.110 0.267057
## Mjobteacher -0.25304 0.53837 -0.470 0.638348
## guardianmother -0.82071 0.34918 -2.350 0.018755 *
## guardianother 0.01112 0.74944 0.015 0.988160
## schoolMS -0.35999 0.51109 -0.704 0.481214
## reasonhome 0.41977 0.38309 1.096 0.273185
## reasonother 1.32112 0.51711 2.555 0.010624 *
## reasonreputation -0.04785 0.39908 -0.120 0.904569
## paidyes 0.61105 0.30511 2.003 0.045206 *
## activitiesyes -0.53253 0.30174 -1.765 0.077585 .
## nurseryyes -0.44481 0.35308 -1.260 0.207738
## romanticyes -0.50890 0.31513 -1.615 0.106340
## famrel -0.51743 0.16515 -3.133 0.001730 **
## studytime -0.27759 0.19682 -1.410 0.158435
## traveltime 0.23358 0.22496 1.038 0.299133
## freetime 0.21474 0.16047 1.338 0.180840
## goout 0.87700 0.14758 5.943 2.81e-09 ***
## failures 0.25908 0.25368 1.021 0.307102
## absences 0.09469 0.02499 3.790 0.000151 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 320.15 on 344 degrees of freedom
## AIC: 372.15
##
## Number of Fisher Scoring iterations: 5
m30 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + traveltime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m30)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + nursery + romantic +
## famrel + studytime + traveltime + freetime + goout + failures +
## absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8839 -0.6898 -0.3775 0.5367 2.8255
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.17730 1.15541 -1.884 0.059505 .
## sexM 0.89525 0.31999 2.798 0.005146 **
## health 0.19003 0.10953 1.735 0.082761 .
## addressU -0.72544 0.38358 -1.891 0.058593 .
## famsizeLE3 0.51249 0.31324 1.636 0.101822
## Mjobhealth -0.77546 0.63678 -1.218 0.223304
## Mjobother -0.70399 0.44936 -1.567 0.117195
## Mjobservices -0.54322 0.48904 -1.111 0.266656
## Mjobteacher -0.27495 0.53673 -0.512 0.608467
## guardianmother -0.79620 0.34737 -2.292 0.021900 *
## guardianother 0.02925 0.75311 0.039 0.969021
## reasonhome 0.40747 0.38255 1.065 0.286817
## reasonother 1.26791 0.51068 2.483 0.013035 *
## reasonreputation -0.02835 0.39735 -0.071 0.943126
## paidyes 0.60009 0.30427 1.972 0.048585 *
## activitiesyes -0.51400 0.30017 -1.712 0.086831 .
## nurseryyes -0.43167 0.35401 -1.219 0.222709
## romanticyes -0.52006 0.31407 -1.656 0.097747 .
## famrel -0.51113 0.16409 -3.115 0.001840 **
## studytime -0.26663 0.19567 -1.363 0.173005
## traveltime 0.20841 0.22109 0.943 0.345871
## freetime 0.20529 0.15996 1.283 0.199357
## goout 0.87369 0.14741 5.927 3.08e-09 ***
## failures 0.27010 0.25267 1.069 0.285069
## absences 0.09587 0.02486 3.857 0.000115 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 320.66 on 345 degrees of freedom
## AIC: 370.66
##
## Number of Fisher Scoring iterations: 5
m31 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + freetime + goout + failures + absences, data = alc, family = "binomial")
summary(m31)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + nursery + romantic +
## famrel + studytime + freetime + goout + failures + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9201 -0.6693 -0.3788 0.6131 2.8134
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.73261 1.04989 -1.650 0.098888 .
## sexM 0.90570 0.32041 2.827 0.004704 **
## health 0.19158 0.10943 1.751 0.080000 .
## addressU -0.85025 0.35852 -2.372 0.017714 *
## famsizeLE3 0.54170 0.31175 1.738 0.082274 .
## Mjobhealth -0.82277 0.63458 -1.297 0.194785
## Mjobother -0.71149 0.44970 -1.582 0.113618
## Mjobservices -0.54651 0.48736 -1.121 0.262138
## Mjobteacher -0.30889 0.53811 -0.574 0.565951
## guardianmother -0.82702 0.34479 -2.399 0.016456 *
## guardianother 0.09127 0.74590 0.122 0.902610
## reasonhome 0.37498 0.38059 0.985 0.324487
## reasonother 1.21870 0.50695 2.404 0.016218 *
## reasonreputation -0.05788 0.39576 -0.146 0.883733
## paidyes 0.61034 0.30451 2.004 0.045036 *
## activitiesyes -0.51451 0.30001 -1.715 0.086355 .
## nurseryyes -0.43764 0.35353 -1.238 0.215742
## romanticyes -0.51168 0.31366 -1.631 0.102818
## famrel -0.50391 0.16334 -3.085 0.002035 **
## studytime -0.28187 0.19527 -1.444 0.148876
## freetime 0.19340 0.15912 1.215 0.224208
## goout 0.88148 0.14680 6.004 1.92e-09 ***
## failures 0.28102 0.25169 1.117 0.264191
## absences 0.09545 0.02474 3.859 0.000114 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 321.54 on 346 degrees of freedom
## AIC: 369.54
##
## Number of Fisher Scoring iterations: 5
m32 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + nursery + romantic + famrel + studytime + freetime + goout + absences, data = alc, family = "binomial")
summary(m32)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + nursery + romantic +
## famrel + studytime + freetime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8364 -0.6916 -0.3742 0.6186 2.8205
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.62838 1.03909 -1.567 0.11709
## sexM 0.92409 0.31965 2.891 0.00384 **
## health 0.20762 0.10799 1.923 0.05453 .
## addressU -0.86632 0.35739 -2.424 0.01535 *
## famsizeLE3 0.54741 0.31159 1.757 0.07895 .
## Mjobhealth -0.91185 0.63436 -1.437 0.15060
## Mjobother -0.75304 0.44670 -1.686 0.09184 .
## Mjobservices -0.56842 0.48561 -1.171 0.24179
## Mjobteacher -0.41462 0.53144 -0.780 0.43528
## guardianmother -0.81293 0.34396 -2.363 0.01811 *
## guardianother 0.21191 0.73660 0.288 0.77359
## reasonhome 0.37154 0.37997 0.978 0.32817
## reasonother 1.21736 0.50496 2.411 0.01592 *
## reasonreputation -0.08370 0.39379 -0.213 0.83167
## paidyes 0.57695 0.30272 1.906 0.05666 .
## activitiesyes -0.52457 0.29905 -1.754 0.07941 .
## nurseryyes -0.42051 0.35095 -1.198 0.23084
## romanticyes -0.50262 0.31226 -1.610 0.10748
## famrel -0.52456 0.16165 -3.245 0.00117 **
## studytime -0.31888 0.19250 -1.657 0.09762 .
## freetime 0.19459 0.15863 1.227 0.21994
## goout 0.90675 0.14581 6.219 5.01e-10 ***
## absences 0.09679 0.02472 3.916 8.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 322.81 on 347 degrees of freedom
## AIC: 368.81
##
## Number of Fisher Scoring iterations: 5
m33 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + romantic + famrel + studytime + freetime + goout + absences, data = alc, family = "binomial")
summary(m33)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + romantic + famrel +
## studytime + freetime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8594 -0.7167 -0.3836 0.6146 2.7734
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.81020 1.02798 -1.761 0.07825 .
## sexM 0.90662 0.31905 2.842 0.00449 **
## health 0.20825 0.10722 1.942 0.05210 .
## addressU -0.88950 0.35733 -2.489 0.01280 *
## famsizeLE3 0.49667 0.30835 1.611 0.10723
## Mjobhealth -0.99444 0.63260 -1.572 0.11595
## Mjobother -0.73015 0.44639 -1.636 0.10191
## Mjobservices -0.56579 0.48261 -1.172 0.24105
## Mjobteacher -0.45608 0.52878 -0.863 0.38840
## guardianmother -0.83220 0.34354 -2.422 0.01542 *
## guardianother 0.30055 0.73749 0.408 0.68362
## reasonhome 0.36564 0.37990 0.962 0.33581
## reasonother 1.20614 0.50090 2.408 0.01604 *
## reasonreputation -0.08088 0.39255 -0.206 0.83676
## paidyes 0.56829 0.30209 1.881 0.05995 .
## activitiesyes -0.50589 0.29824 -1.696 0.08984 .
## romanticyes -0.49832 0.31073 -1.604 0.10879
## famrel -0.52072 0.16068 -3.241 0.00119 **
## studytime -0.34536 0.19110 -1.807 0.07073 .
## freetime 0.18664 0.15763 1.184 0.23639
## goout 0.89963 0.14511 6.200 5.65e-10 ***
## absences 0.09505 0.02457 3.868 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 324.23 on 348 degrees of freedom
## AIC: 368.23
##
## Number of Fisher Scoring iterations: 5
m34 <- glm(high_use ~ sex + health + address + famsize + Mjob + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m34)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + Mjob +
## guardian + reason + paid + activities + romantic + famrel +
## studytime + goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8798 -0.7229 -0.3861 0.6079 2.7816
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.42394 0.96719 -1.472 0.14096
## sexM 0.94987 0.31747 2.992 0.00277 **
## health 0.20524 0.10683 1.921 0.05472 .
## addressU -0.86236 0.35600 -2.422 0.01542 *
## famsizeLE3 0.48070 0.30752 1.563 0.11802
## Mjobhealth -0.98267 0.62987 -1.560 0.11873
## Mjobother -0.69956 0.44378 -1.576 0.11494
## Mjobservices -0.52858 0.47796 -1.106 0.26877
## Mjobteacher -0.41155 0.52733 -0.780 0.43513
## guardianmother -0.84478 0.34325 -2.461 0.01385 *
## guardianother 0.37225 0.73463 0.507 0.61235
## reasonhome 0.30688 0.37404 0.820 0.41196
## reasonother 1.17332 0.49988 2.347 0.01891 *
## reasonreputation -0.11529 0.39057 -0.295 0.76786
## paidyes 0.56116 0.30094 1.865 0.06222 .
## activitiesyes -0.47737 0.29595 -1.613 0.10674
## romanticyes -0.46686 0.30805 -1.516 0.12964
## famrel -0.50323 0.15945 -3.156 0.00160 **
## studytime -0.35054 0.19051 -1.840 0.06576 .
## goout 0.94273 0.14165 6.655 2.83e-11 ***
## absences 0.09338 0.02452 3.807 0.00014 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 325.64 on 349 degrees of freedom
## AIC: 367.64
##
## Number of Fisher Scoring iterations: 5
m35 <- glm(high_use ~ sex + health + address + famsize + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m35)
##
## Call:
## glm(formula = high_use ~ sex + health + address + famsize + guardian +
## reason + paid + activities + romantic + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8674 -0.7171 -0.4029 0.5821 2.7665
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.72681 0.94633 -1.825 0.068041 .
## sexM 0.91034 0.30568 2.978 0.002901 **
## health 0.17681 0.10399 1.700 0.089096 .
## addressU -0.94771 0.34320 -2.761 0.005756 **
## famsizeLE3 0.45183 0.30426 1.485 0.137536
## guardianmother -0.75088 0.33290 -2.256 0.024096 *
## guardianother 0.40855 0.73349 0.557 0.577533
## reasonhome 0.20790 0.36585 0.568 0.569858
## reasonother 1.02210 0.48846 2.093 0.036393 *
## reasonreputation -0.27714 0.37752 -0.734 0.462877
## paidyes 0.54838 0.29252 1.875 0.060834 .
## activitiesyes -0.46361 0.28927 -1.603 0.109002
## romanticyes -0.45457 0.30736 -1.479 0.139160
## famrel -0.48800 0.15813 -3.086 0.002028 **
## studytime -0.32528 0.18830 -1.727 0.084081 .
## goout 0.90858 0.13835 6.567 5.13e-11 ***
## absences 0.09174 0.02432 3.773 0.000161 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 329.01 on 353 degrees of freedom
## AIC: 363.01
##
## Number of Fisher Scoring iterations: 5
m36 <- glm(high_use ~ sex + health + address + guardian + reason + paid + activities + romantic + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m36)
##
## Call:
## glm(formula = high_use ~ sex + health + address + guardian +
## reason + paid + activities + romantic + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9142 -0.7135 -0.4238 0.6314 2.7314
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.57319 0.93900 -1.675 0.093858 .
## sexM 0.94921 0.30255 3.137 0.001705 **
## health 0.16840 0.10362 1.625 0.104131
## addressU -0.88181 0.33899 -2.601 0.009288 **
## guardianmother -0.69988 0.32862 -2.130 0.033194 *
## guardianother 0.40360 0.72237 0.559 0.576358
## reasonhome 0.17061 0.36325 0.470 0.638586
## reasonother 0.97289 0.48759 1.995 0.046011 *
## reasonreputation -0.28239 0.37686 -0.749 0.453655
## paidyes 0.54303 0.29151 1.863 0.062486 .
## activitiesyes -0.47048 0.28867 -1.630 0.103146
## romanticyes -0.41451 0.30400 -1.363 0.172726
## famrel -0.49000 0.15648 -3.131 0.001740 **
## studytime -0.33980 0.18695 -1.818 0.069132 .
## goout 0.89470 0.13681 6.540 6.17e-11 ***
## absences 0.09182 0.02445 3.755 0.000174 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 331.20 on 354 degrees of freedom
## AIC: 363.2
##
## Number of Fisher Scoring iterations: 5
m37 <- glm(high_use ~ sex + health + address + guardian + reason + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m37)
##
## Call:
## glm(formula = high_use ~ sex + health + address + guardian +
## reason + paid + activities + famrel + studytime + goout +
## absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8365 -0.7167 -0.4191 0.6244 2.7675
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.64430 0.93738 -1.754 0.079404 .
## sexM 0.95963 0.30116 3.186 0.001440 **
## health 0.15499 0.10292 1.506 0.132074
## addressU -0.87733 0.33754 -2.599 0.009346 **
## guardianmother -0.69452 0.32728 -2.122 0.033826 *
## guardianother 0.34130 0.71444 0.478 0.632848
## reasonhome 0.17125 0.36156 0.474 0.635751
## reasonother 0.88750 0.48109 1.845 0.065069 .
## reasonreputation -0.26597 0.37622 -0.707 0.479601
## paidyes 0.55647 0.29067 1.914 0.055563 .
## activitiesyes -0.49094 0.28719 -1.709 0.087364 .
## famrel -0.47145 0.15487 -3.044 0.002334 **
## studytime -0.35932 0.18770 -1.914 0.055578 .
## goout 0.88596 0.13606 6.511 7.45e-11 ***
## absences 0.08737 0.02401 3.639 0.000274 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 333.10 on 355 degrees of freedom
## AIC: 363.1
##
## Number of Fisher Scoring iterations: 5
m38 <- glm(high_use ~ sex + address + guardian + reason + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m38)
##
## Call:
## glm(formula = high_use ~ sex + address + guardian + reason +
## paid + activities + famrel + studytime + goout + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9219 -0.7435 -0.4192 0.6018 2.8389
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.12869 0.86885 -1.299 0.193927
## sexM 1.00904 0.29874 3.378 0.000731 ***
## addressU -0.89571 0.33781 -2.652 0.008013 **
## guardianmother -0.70751 0.32584 -2.171 0.029904 *
## guardianother 0.29125 0.71045 0.410 0.681839
## reasonhome 0.14091 0.35838 0.393 0.694178
## reasonother 0.86090 0.47814 1.800 0.071783 .
## reasonreputation -0.33539 0.37241 -0.901 0.367793
## paidyes 0.52319 0.28763 1.819 0.068918 .
## activitiesyes -0.50255 0.28643 -1.755 0.079335 .
## famrel -0.43812 0.15217 -2.879 0.003987 **
## studytime -0.35475 0.18676 -1.899 0.057500 .
## goout 0.87536 0.13595 6.439 1.21e-10 ***
## absences 0.08705 0.02377 3.662 0.000251 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 335.41 on 356 degrees of freedom
## AIC: 363.41
##
## Number of Fisher Scoring iterations: 5
m39 <- glm(high_use ~ sex + address + guardian + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m39)
##
## Call:
## glm(formula = high_use ~ sex + address + guardian + paid + activities +
## famrel + studytime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8631 -0.7338 -0.4343 0.6525 2.7038
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.98780 0.84464 -1.169 0.242204
## sexM 1.01751 0.29384 3.463 0.000535 ***
## addressU -0.88774 0.32823 -2.705 0.006838 **
## guardianmother -0.68091 0.32127 -2.119 0.034056 *
## guardianother 0.27875 0.68783 0.405 0.685287
## paidyes 0.58954 0.27992 2.106 0.035195 *
## activitiesyes -0.52740 0.27932 -1.888 0.059003 .
## famrel -0.41406 0.14977 -2.765 0.005698 **
## studytime -0.42783 0.18304 -2.337 0.019422 *
## goout 0.84987 0.13236 6.421 1.35e-10 ***
## absences 0.08434 0.02276 3.705 0.000211 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 341.12 on 359 degrees of freedom
## AIC: 363.12
##
## Number of Fisher Scoring iterations: 5
m40 <- glm(high_use ~ sex + address + paid + activities + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m40)
##
## Call:
## glm(formula = high_use ~ sex + address + paid + activities +
## famrel + studytime + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8947 -0.7350 -0.4209 0.6334 2.8292
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.50769 0.81175 -1.857 0.063263 .
## sexM 1.03295 0.29075 3.553 0.000381 ***
## addressU -0.80243 0.32151 -2.496 0.012566 *
## paidyes 0.55253 0.27681 1.996 0.045928 *
## activitiesyes -0.50399 0.27583 -1.827 0.067672 .
## famrel -0.39748 0.14678 -2.708 0.006769 **
## studytime -0.41410 0.18179 -2.278 0.022734 *
## goout 0.82299 0.12977 6.342 2.27e-10 ***
## absences 0.07971 0.02254 3.536 0.000406 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 347.06 on 361 degrees of freedom
## AIC: 365.06
##
## Number of Fisher Scoring iterations: 5
m41 <- glm(high_use ~ sex + address + paid + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m41)
##
## Call:
## glm(formula = high_use ~ sex + address + paid + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9387 -0.7500 -0.4471 0.7226 2.7051
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.55596 0.80528 -1.932 0.053334 .
## sexM 0.94192 0.28379 3.319 0.000903 ***
## addressU -0.74838 0.31779 -2.355 0.018526 *
## paidyes 0.54137 0.27503 1.968 0.049027 *
## famrel -0.39952 0.14509 -2.754 0.005896 **
## studytime -0.46626 0.18048 -2.583 0.009782 **
## goout 0.80085 0.12757 6.278 3.43e-10 ***
## absences 0.07814 0.02253 3.468 0.000524 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 350.45 on 362 degrees of freedom
## AIC: 366.45
##
## Number of Fisher Scoring iterations: 5
m42 <- glm(high_use ~ sex + address + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m42)
##
## Call:
## glm(formula = high_use ~ sex + address + famrel + studytime +
## goout + absences, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9143 -0.7377 -0.4386 0.7459 2.5829
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.32310 0.79014 -1.675 0.094030 .
## sexM 0.88170 0.27864 3.164 0.001555 **
## addressU -0.68822 0.31409 -2.191 0.028439 *
## famrel -0.40542 0.14396 -2.816 0.004860 **
## studytime -0.42030 0.17536 -2.397 0.016538 *
## goout 0.78974 0.12648 6.244 4.27e-10 ***
## absences 0.07432 0.02230 3.332 0.000861 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 354.40 on 363 degrees of freedom
## AIC: 368.4
##
## Number of Fisher Scoring iterations: 4
m43 <- glm(high_use ~ sex + famrel + studytime + goout + absences, data = alc, family = "binomial")
summary(m43)
##
## Call:
## glm(formula = high_use ~ sex + famrel + studytime + goout + absences,
## family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7673 -0.7725 -0.4525 0.7122 2.6684
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.80705 0.75693 -2.387 0.016970 *
## sexM 0.90778 0.27500 3.301 0.000963 ***
## famrel -0.40238 0.14164 -2.841 0.004499 **
## studytime -0.41980 0.17376 -2.416 0.015693 *
## goout 0.76397 0.12465 6.129 8.85e-10 ***
## absences 0.07613 0.02242 3.396 0.000683 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 359.16 on 364 degrees of freedom
## AIC: 371.16
##
## Number of Fisher Scoring iterations: 4
m44 <- glm(high_use ~ sex + famrel + goout + absences, data = alc, family = "binomial")
summary(m44)
##
## Call:
## glm(formula = high_use ~ sex + famrel + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7082 -0.7909 -0.5005 0.7441 2.5673
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.71101 0.66639 -4.068 4.74e-05 ***
## sexM 1.07940 0.26449 4.081 4.48e-05 ***
## famrel -0.41726 0.14164 -2.946 0.003219 **
## goout 0.76974 0.12445 6.185 6.20e-10 ***
## absences 0.08218 0.02227 3.691 0.000224 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 365.31 on 365 degrees of freedom
## AIC: 375.31
##
## Number of Fisher Scoring iterations: 4
m45 <- glm(high_use ~ sex + goout + absences, data = alc, family = "binomial")
summary(m45)
##
## Call:
## glm(formula = high_use ~ sex + goout + absences, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8060 -0.8090 -0.5248 0.8214 2.4806
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.18142 0.48085 -8.696 < 2e-16 ***
## sexM 1.02223 0.25946 3.940 8.15e-05 ***
## goout 0.72793 0.12057 6.038 1.56e-09 ***
## absences 0.08478 0.02266 3.741 0.000183 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 374.10 on 366 degrees of freedom
## AIC: 382.1
##
## Number of Fisher Scoring iterations: 4
m46 <- glm(high_use ~ sex + goout, data = alc, family = "binomial")
summary(m46)
##
## Call:
## glm(formula = high_use ~ sex + goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.5740 -0.8658 -0.6216 0.8272 2.4929
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.8193 0.4572 -8.354 < 2e-16 ***
## sexM 0.9268 0.2507 3.696 0.000219 ***
## goout 0.7578 0.1190 6.369 1.9e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 388.94 on 367 degrees of freedom
## AIC: 394.94
##
## Number of Fisher Scoring iterations: 4
m47 <- glm(high_use ~ goout, data = alc, family = "binomial")
summary(m47)
##
## Call:
## glm(formula = high_use ~ goout, family = "binomial", data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.3712 -0.7687 -0.5470 0.9952 2.3037
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.3368 0.4188 -7.968 1.62e-15 ***
## goout 0.7563 0.1165 6.494 8.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 403.05 on 368 degrees of freedom
## AIC: 407.05
##
## Number of Fisher Scoring iterations: 4
# predict() the probabilities of high_use
prob21 <- predict(m21, type = "response")
prob22 <- predict(m22, type = "response")
prob23 <- predict(m23, type = "response")
prob24 <- predict(m24, type = "response")
prob25 <- predict(m25, type = "response")
prob26 <- predict(m26, type = "response")
prob27 <- predict(m27, type = "response")
prob28 <- predict(m28, type = "response")
prob29 <- predict(m29, type = "response")
prob30 <- predict(m30, type = "response")
prob31 <- predict(m31, type = "response")
prob32 <- predict(m32, type = "response")
prob33 <- predict(m33, type = "response")
prob34 <- predict(m34, type = "response")
prob35 <- predict(m35, type = "response")
prob36 <- predict(m36, type = "response")
prob37 <- predict(m37, type = "response")
prob38 <- predict(m38, type = "response")
prob39 <- predict(m39, type = "response")
prob40 <- predict(m40, type = "response")
prob41 <- predict(m41, type = "response")
prob42 <- predict(m42, type = "response")
prob43 <- predict(m43, type = "response")
prob44 <- predict(m44, type = "response")
prob45 <- predict(m45, type = "response")
prob46 <- predict(m46, type = "response")
prob47 <- predict(m47, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, prob21 = prob21)
alc <- mutate(alc, prob22 = prob22)
alc <- mutate(alc, prob23 = prob23)
alc <- mutate(alc, prob24 = prob24)
alc <- mutate(alc, prob25 = prob25)
alc <- mutate(alc, prob26 = prob26)
alc <- mutate(alc, prob27 = prob27)
alc <- mutate(alc, prob28 = prob28)
alc <- mutate(alc, prob29 = prob29)
alc <- mutate(alc, prob30 = prob30)
alc <- mutate(alc, prob31 = prob31)
alc <- mutate(alc, prob32 = prob32)
alc <- mutate(alc, prob33 = prob33)
alc <- mutate(alc, prob34 = prob34)
alc <- mutate(alc, prob35 = prob35)
alc <- mutate(alc, prob36 = prob36)
alc <- mutate(alc, prob37 = prob37)
alc <- mutate(alc, prob38 = prob38)
alc <- mutate(alc, prob39 = prob39)
alc <- mutate(alc, prob40 = prob40)
alc <- mutate(alc, prob41 = prob41)
alc <- mutate(alc, prob42 = prob42)
alc <- mutate(alc, prob43 = prob43)
alc <- mutate(alc, prob44 = prob44)
alc <- mutate(alc, prob45 = prob45)
alc <- mutate(alc, prob46 = prob46)
alc <- mutate(alc, prob47 = prob47)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, pred21 = prob21 > 0.5)
alc <- mutate(alc, pred22 = prob22 > 0.5)
alc <- mutate(alc, pred23 = prob23 > 0.5)
alc <- mutate(alc, pred24 = prob24 > 0.5)
alc <- mutate(alc, pred25 = prob25 > 0.5)
alc <- mutate(alc, pred26 = prob26 > 0.5)
alc <- mutate(alc, pred27 = prob27 > 0.5)
alc <- mutate(alc, pred28 = prob28 > 0.5)
alc <- mutate(alc, pred29 = prob29 > 0.5)
alc <- mutate(alc, pred30 = prob30 > 0.5)
alc <- mutate(alc, pred31 = prob31 > 0.5)
alc <- mutate(alc, pred32 = prob32 > 0.5)
alc <- mutate(alc, pred33 = prob33 > 0.5)
alc <- mutate(alc, pred34 = prob34 > 0.5)
alc <- mutate(alc, pred35 = prob35 > 0.5)
alc <- mutate(alc, pred36 = prob36 > 0.5)
alc <- mutate(alc, pred37 = prob37 > 0.5)
alc <- mutate(alc, pred38 = prob38 > 0.5)
alc <- mutate(alc, pred39 = prob39 > 0.5)
alc <- mutate(alc, pred40 = prob40 > 0.5)
alc <- mutate(alc, pred41 = prob41 > 0.5)
alc <- mutate(alc, pred42 = prob42 > 0.5)
alc <- mutate(alc, pred43 = prob43 > 0.5)
alc <- mutate(alc, pred44 = prob44 > 0.5)
alc <- mutate(alc, pred45 = prob45 > 0.5)
alc <- mutate(alc, pred46 = prob46 > 0.5)
alc <- mutate(alc, pred47 = prob47 > 0.5)
# print column names
colnames(alc)
## [1] "X" "school" "sex" "age" "address"
## [6] "famsize" "Pstatus" "Medu" "Fedu" "Mjob"
## [11] "Fjob" "reason" "guardian" "traveltime" "studytime"
## [16] "schoolsup" "famsup" "activities" "nursery" "higher"
## [21] "internet" "romantic" "famrel" "freetime" "goout"
## [26] "Dalc" "Walc" "health" "failures" "paid"
## [31] "absences" "G1" "G2" "G3" "alc_use"
## [36] "high_use" "probability" "prediction" "prob21" "prob22"
## [41] "prob23" "prob24" "prob25" "prob26" "prob27"
## [46] "prob28" "prob29" "prob30" "prob31" "prob32"
## [51] "prob33" "prob34" "prob35" "prob36" "prob37"
## [56] "prob38" "prob39" "prob40" "prob41" "prob42"
## [61] "prob43" "prob44" "prob45" "prob46" "prob47"
## [66] "pred21" "pred22" "pred23" "pred24" "pred25"
## [71] "pred26" "pred27" "pred28" "pred29" "pred30"
## [76] "pred31" "pred32" "pred33" "pred34" "pred35"
## [81] "pred36" "pred37" "pred38" "pred39" "pred40"
## [86] "pred41" "pred42" "pred43" "pred44" "pred45"
## [91] "pred46" "pred47"
# define a loss function (average prediction error)
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$prob21) %>% round(digits = 3)
## [1] 0.189
loss_func(class = alc$high_use, prob = alc$prob22) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob23) %>% round(digits = 3)
## [1] 0.184
loss_func(class = alc$high_use, prob = alc$prob24) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob25) %>% round(digits = 3)
## [1] 0.178
loss_func(class = alc$high_use, prob = alc$prob26) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob27) %>% round(digits = 3)
## [1] 0.176
loss_func(class = alc$high_use, prob = alc$prob28) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob29) %>% round(digits = 3)
## [1] 0.178
loss_func(class = alc$high_use, prob = alc$prob30) %>% round(digits = 3)
## [1] 0.184
loss_func(class = alc$high_use, prob = alc$prob31) %>% round(digits = 3)
## [1] 0.195
loss_func(class = alc$high_use, prob = alc$prob32) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob33) %>% round(digits = 3)
## [1] 0.203
loss_func(class = alc$high_use, prob = alc$prob34) %>% round(digits = 3)
## [1] 0.186
loss_func(class = alc$high_use, prob = alc$prob35) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob36) %>% round(digits = 3)
## [1] 0.181
loss_func(class = alc$high_use, prob = alc$prob37) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob38) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob39) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob40) %>% round(digits = 3)
## [1] 0.203
loss_func(class = alc$high_use, prob = alc$prob41) %>% round(digits = 3)
## [1] 0.197
loss_func(class = alc$high_use, prob = alc$prob42) %>% round(digits = 3)
## [1] 0.211
loss_func(class = alc$high_use, prob = alc$prob43) %>% round(digits = 3)
## [1] 0.216
loss_func(class = alc$high_use, prob = alc$prob44) %>% round(digits = 3)
## [1] 0.2
loss_func(class = alc$high_use, prob = alc$prob45) %>% round(digits = 3)
## [1] 0.211
loss_func(class = alc$high_use, prob = alc$prob46) %>% round(digits = 3)
## [1] 0.214
loss_func(class = alc$high_use, prob = alc$prob47) %>% round(digits = 3)
## [1] 0.27
# K-fold cross-validation and average number of wrong predictions in the cross validation
library(boot)
cv21 <- cv.glm(data = alc, cost = loss_func, glmfit = m21, K = nrow(alc) / 10)
cv22 <- cv.glm(data = alc, cost = loss_func, glmfit = m22, K = nrow(alc) / 10)
cv23 <- cv.glm(data = alc, cost = loss_func, glmfit = m23, K = nrow(alc) / 10)
cv24 <- cv.glm(data = alc, cost = loss_func, glmfit = m24, K = nrow(alc) / 10)
cv25 <- cv.glm(data = alc, cost = loss_func, glmfit = m25, K = nrow(alc) / 10)
cv26 <- cv.glm(data = alc, cost = loss_func, glmfit = m26, K = nrow(alc) / 10)
cv27 <- cv.glm(data = alc, cost = loss_func, glmfit = m27, K = nrow(alc) / 10)
cv28 <- cv.glm(data = alc, cost = loss_func, glmfit = m28, K = nrow(alc) / 10)
cv29 <- cv.glm(data = alc, cost = loss_func, glmfit = m29, K = nrow(alc) / 10)
cv30 <- cv.glm(data = alc, cost = loss_func, glmfit = m30, K = nrow(alc) / 10)
cv31 <- cv.glm(data = alc, cost = loss_func, glmfit = m31, K = nrow(alc) / 10)
cv32 <- cv.glm(data = alc, cost = loss_func, glmfit = m32, K = nrow(alc) / 10)
cv33 <- cv.glm(data = alc, cost = loss_func, glmfit = m33, K = nrow(alc) / 10)
cv34 <- cv.glm(data = alc, cost = loss_func, glmfit = m34, K = nrow(alc) / 10)
cv35 <- cv.glm(data = alc, cost = loss_func, glmfit = m35, K = nrow(alc) / 10)
cv36 <- cv.glm(data = alc, cost = loss_func, glmfit = m36, K = nrow(alc) / 10)
cv37 <- cv.glm(data = alc, cost = loss_func, glmfit = m37, K = nrow(alc) / 10)
cv38 <- cv.glm(data = alc, cost = loss_func, glmfit = m38, K = nrow(alc) / 10)
cv39 <- cv.glm(data = alc, cost = loss_func, glmfit = m39, K = nrow(alc) / 10)
cv40 <- cv.glm(data = alc, cost = loss_func, glmfit = m40, K = nrow(alc) / 10)
cv41 <- cv.glm(data = alc, cost = loss_func, glmfit = m41, K = nrow(alc) / 10)
cv42 <- cv.glm(data = alc, cost = loss_func, glmfit = m42, K = nrow(alc) / 10)
cv43 <- cv.glm(data = alc, cost = loss_func, glmfit = m43, K = nrow(alc) / 10)
cv44 <- cv.glm(data = alc, cost = loss_func, glmfit = m44, K = nrow(alc) / 10)
cv45 <- cv.glm(data = alc, cost = loss_func, glmfit = m45, K = nrow(alc) / 10)
cv46 <- cv.glm(data = alc, cost = loss_func, glmfit = m46, K = nrow(alc) / 10)
cv47 <- cv.glm(data = alc, cost = loss_func, glmfit = m47, K = nrow(alc) / 10)
# average number of wrong predictions in the cross validation
cv21$delta[1] %>% round(digits = 3)
## [1] 0.257
cv22$delta[1] %>% round(digits = 3)
## [1] 0.241
cv23$delta[1] %>% round(digits = 3)
## [1] 0.243
cv24$delta[1] %>% round(digits = 3)
## [1] 0.238
cv25$delta[1] %>% round(digits = 3)
## [1] 0.241
cv26$delta[1] %>% round(digits = 3)
## [1] 0.232
cv27$delta[1] %>% round(digits = 3)
## [1] 0.23
cv28$delta[1] %>% round(digits = 3)
## [1] 0.238
cv29$delta[1] %>% round(digits = 3)
## [1] 0.23
cv30$delta[1] %>% round(digits = 3)
## [1] 0.23
cv31$delta[1] %>% round(digits = 3)
## [1] 0.227
cv32$delta[1] %>% round(digits = 3)
## [1] 0.23
cv33$delta[1] %>% round(digits = 3)
## [1] 0.227
cv34$delta[1] %>% round(digits = 3)
## [1] 0.214
cv35$delta[1] %>% round(digits = 3)
## [1] 0.227
cv36$delta[1] %>% round(digits = 3)
## [1] 0.23
cv37$delta[1] %>% round(digits = 3)
## [1] 0.222
cv38$delta[1] %>% round(digits = 3)
## [1] 0.219
cv39$delta[1] %>% round(digits = 3)
## [1] 0.222
cv40$delta[1] %>% round(digits = 3)
## [1] 0.222
cv41$delta[1] %>% round(digits = 3)
## [1] 0.216
cv42$delta[1] %>% round(digits = 3)
## [1] 0.224
cv43$delta[1] %>% round(digits = 3)
## [1] 0.227
cv44$delta[1] %>% round(digits = 3)
## [1] 0.208
cv45$delta[1] %>% round(digits = 3)
## [1] 0.219
cv46$delta[1] %>% round(digits = 3)
## [1] 0.232
cv47$delta[1] %>% round(digits = 3)
## [1] 0.27
# create a new dataset
err <- data.frame(npredictor = c(1:27),
train = c(0.189, 0.186, 0.184, 0.181, 0.178, 0.186, 0.176, 0.181, 0.178, 0.184, 0.195, 0.200, 0.203, 0.186, 0.200, 0.181, 0.200, 0.200, 0.200, 0.203, 0.197, 0.211, 0.216, 0.200, 0.211, 0.214, 0.270),
test = c(0.259, 0.235, 0.249, 0.235, 0.241, 0.251, 0.232, 0.23, 0.227, 0.227, 0.232, 0.230, 0.227, 0.216, 0.219, 0.219, 0.227, 0.222, 0.222, 0.224, 0.211, 0.224, 0.230, 0.216, 0.219, 0.243, 0.270))
# change the format of dataset
err_long <- err %>% pivot_longer(cols = c("train", "test"), names_to = "model", values_to = "ploss")
# print dimensions and column names
dim(err); colnames(err)
## [1] 27 3
## [1] "npredictor" "train" "test"
dim(err_long); colnames(err_long)
## [1] 54 3
## [1] "npredictor" "model" "ploss"
# access the ggplot2 library
library(ggplot2)
# Initialize plots with data and aesthetic mapping
ep1 <- ggplot(err_long, aes(x = npredictor, y = ploss, color = model))
# Define the visualization type (points) + add a regression line
# + add a main title and draw the plot
ep2 <- ep1 + geom_point() + geom_smooth(method = "glm", formula = "y ~ poly(x, 2)") + ggtitle("Predictive loss in training and testing")
ep2
# new data for prediction
errpred <- data.frame(npredictor = c(1:27))
# fits the model
predm1 <- lm(train ~ npredictor^2, data = err)
predm2 <- lm(test ~ npredictor^2, data = err)
predm1 <- glm(formula = "train ~ poly(npredictor, 2)", data = err)
predm2 <- glm(formula = "test ~ poly(npredictor, 2)", data = err)
# predicts the values with confidence interval
predict(predm1, newdata = errpred, interval = 'confidence') %>% round(digits = 3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0.187 0.186 0.185 0.184 0.183 0.183 0.183 0.183 0.184 0.184 0.185 0.186 0.188
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0.189 0.191 0.193 0.196 0.198 0.201 0.204 0.207 0.211 0.215 0.219 0.223 0.227
## 27
## 0.232
predict(predm2, newdata = errpred, interval = 'confidence') %>% round(digits = 3)
## 1 2 3 4 5 6 7 8 9 10 11 12 13
## 0.256 0.251 0.247 0.243 0.239 0.236 0.233 0.230 0.228 0.226 0.224 0.223 0.222
## 14 15 16 17 18 19 20 21 22 23 24 25 26
## 0.221 0.221 0.221 0.221 0.222 0.223 0.224 0.225 0.227 0.230 0.232 0.235 0.238
## 27
## 0.242
predict(predm1, newdata = errpred, interval = 'confidence') %>% round(digits = 3) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1830 0.1845 0.1890 0.1966 0.2055 0.2320
predict(predm2, newdata = errpred, interval = 'confidence') %>% round(digits = 3) %>% summary()
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2210 0.2230 0.2280 0.2311 0.2370 0.2560
# summary values for dataset err
summary(err)
## npredictor train test
## Min. : 1.0 Min. :0.1760 Min. :0.211
## 1st Qu.: 7.5 1st Qu.:0.1840 1st Qu.:0.222
## Median :14.0 Median :0.1970 Median :0.227
## Mean :14.0 Mean :0.1967 Mean :0.231
## 3rd Qu.:20.5 3rd Qu.:0.2015 3rd Qu.:0.235
## Max. :27.0 Max. :0.2700 Max. :0.270
In order to compare different logistic regression models, I chose to start with a model with 27 explanatory variables, i.e. nearly all variables that were available. I built 27 different models, reducing the model by one variable at a time, deleting the variable with the highest p-value. In the model with seven explanatory variables - sex, address, paid, family relations, study time, going out and absences - all are statistically significant (p < 0.05).
The training errors varied from 0.176 to 0.270 and testing errors from 0.211 to 0.270. In the plot, it can be seen that the relationships between the number of the explanatory variables and errors are not linear but more towards 2-degree polynomial. In training, a model with 7 explanatory variables has the smallest error (0.176); in testing, the minimal error (0.211) is with a model with 21 explanatory variables. On the other hand, when these errors are modeled, the smallest error comes with 5-8 explanatory variables in training (0.183) and with 14-17 explanatory variables in testing (0.221). All in all, the errors varied little, implying that the phenomena (high use of alcohol) is indeed multifaceted by nature. In addition, the categorization of some variables does not seem to be statistically optimal and some ordinal variables might be better modeled as categorical ones.
knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Dec 7 15:32:47 2022"
# LOADING THE DATA
# access the MASS package
library(MASS)
# load the data
data("Boston")
# define subset of data for boxplots
box <- dplyr::select(Boston, -tax, -black)
# explore the dataset
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
# plot distributions and matrix of the variables
boxplot(box)
boxplot(Boston$tax)
boxplot(Boston$blac)
p <- pairs(Boston)
p
## NULL
The Boston data contain information from 506 towns on factors such as crime rate, residential land, nitrogen oxides concentration, rooms per dwelling, tax rate, pupil-teacher ratio, population status and value of owner-occupied homes. Altogether 14 variables are included. No values are missing.
Variable distributions are mostly skewed. Median age of population is 77.5 years (range 2.9 to 100 years). Crime rate varies from 0.006 to 89.0, tax rate from 187 to 711 per 10,000 $, pupil-teacher ratio from 12.6 to 22.0, proportion of lower population status from 1.7% to 38.0% and nitrogen oxides concentration 0.39 to 0.87 parts per 10 million between towns.
# EXPLORING THE DATA
# access the tidyr and corrplot libraries
library(tidyr)
library(corrplot)
# calculate the correlation matrix and round it
cor_matrix <- cor(Boston) %>% round(digits = 2)
# print the correlation matrix
cor_matrix
## crim zn indus chas nox rm age dis rad tax ptratio
## crim 1.00 -0.20 0.41 -0.06 0.42 -0.22 0.35 -0.38 0.63 0.58 0.29
## zn -0.20 1.00 -0.53 -0.04 -0.52 0.31 -0.57 0.66 -0.31 -0.31 -0.39
## indus 0.41 -0.53 1.00 0.06 0.76 -0.39 0.64 -0.71 0.60 0.72 0.38
## chas -0.06 -0.04 0.06 1.00 0.09 0.09 0.09 -0.10 -0.01 -0.04 -0.12
## nox 0.42 -0.52 0.76 0.09 1.00 -0.30 0.73 -0.77 0.61 0.67 0.19
## rm -0.22 0.31 -0.39 0.09 -0.30 1.00 -0.24 0.21 -0.21 -0.29 -0.36
## age 0.35 -0.57 0.64 0.09 0.73 -0.24 1.00 -0.75 0.46 0.51 0.26
## dis -0.38 0.66 -0.71 -0.10 -0.77 0.21 -0.75 1.00 -0.49 -0.53 -0.23
## rad 0.63 -0.31 0.60 -0.01 0.61 -0.21 0.46 -0.49 1.00 0.91 0.46
## tax 0.58 -0.31 0.72 -0.04 0.67 -0.29 0.51 -0.53 0.91 1.00 0.46
## ptratio 0.29 -0.39 0.38 -0.12 0.19 -0.36 0.26 -0.23 0.46 0.46 1.00
## black -0.39 0.18 -0.36 0.05 -0.38 0.13 -0.27 0.29 -0.44 -0.44 -0.18
## lstat 0.46 -0.41 0.60 -0.05 0.59 -0.61 0.60 -0.50 0.49 0.54 0.37
## medv -0.39 0.36 -0.48 0.18 -0.43 0.70 -0.38 0.25 -0.38 -0.47 -0.51
## black lstat medv
## crim -0.39 0.46 -0.39
## zn 0.18 -0.41 0.36
## indus -0.36 0.60 -0.48
## chas 0.05 -0.05 0.18
## nox -0.38 0.59 -0.43
## rm 0.13 -0.61 0.70
## age -0.27 0.60 -0.38
## dis 0.29 -0.50 0.25
## rad -0.44 0.49 -0.38
## tax -0.44 0.54 -0.47
## ptratio -0.18 0.37 -0.51
## black 1.00 -0.37 0.33
## lstat -0.37 1.00 -0.74
## medv 0.33 -0.74 1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", tl.pos = "d", tl.cex = 0.6)
The variables are mostly correlated with each other, except for the
Charles River dummy variable describing if the tract bounds river or not
which is correlated only mildly with the median value of owner-occupied
homes (rho = 0.18). There are remarkably high correlations between age
and nitrogen oxides concentration (rho = 0.73) and mean distance to main
employment centres (rho = -0.75), between non-retail business acres and
nitrogen oxides concentration (rho = 0.76), mean distance to main
employment centres (rho = -0.71) and tax rate (rho = 0.72), between
average number of rooms per dwelling and median value of owner-occupied
homes (rho = 0.70), between accessibility to radial highways and tax
rate (rho = 0.91), and between proportion of lower population status and
median value of owner-occupied homes (rho = -0.74).
# SCALING THE DATASET AND CREATING 'CRIME' VARIABLE
# accessing library dplyr
library(dplyr)
# center and standardize variables
boston_scaled <- scale(Boston)
# summaries of the scaled variables
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
# plot distributions
boxplot(boston_scaled)
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix" "array"
# change the object to data frame
boston_scaled <- scale(Boston) %>% as.data.frame
# summary of the scaled crime rate
summary(boston_scaled$crim)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.419367 -0.410563 -0.390280 0.000000 0.007389 9.924110
# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE)
# look at the table of the new factor crime
table(crime)
## crime
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 127 126 126 127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
Scaling removes most of the skewness in variables.
# LINEAR DISCRIMINANT ANALYSIS
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
# save the correct classes from test data
correct_classes <- test$crime
# remove the crime variable from test data
test <- dplyr::select(test, -crime)
# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)
# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## 0.2549505 0.2549505 0.2425743 0.2475248
##
## Group means:
## zn indus chas nox rm
## [-0.419,-0.411] 1.07944043 -0.9667172 -0.15765625 -0.8905182 0.5081647
## (-0.411,-0.39] -0.09739127 -0.3042219 -0.04298342 -0.5331778 -0.1250711
## (-0.39,0.00739] -0.39011035 0.1345288 0.24993933 0.3339979 0.2251615
## (0.00739,9.92] -0.48724019 1.0171519 -0.07547406 1.0522729 -0.4879676
## age dis rad tax ptratio
## [-0.419,-0.411] -0.8999018 0.9132906 -0.6964270 -0.7207324 -0.50056176
## (-0.411,-0.39] -0.2880024 0.2939571 -0.5592799 -0.4648469 -0.06870264
## (-0.39,0.00739] 0.3877924 -0.3703937 -0.3654491 -0.3022553 -0.29666718
## (0.00739,9.92] 0.8290601 -0.8492568 1.6377820 1.5138081 0.78037363
## black lstat medv
## [-0.419,-0.411] 0.38188802 -0.80588970 0.58705789
## (-0.411,-0.39] 0.32319524 -0.10126790 0.01480093
## (-0.39,0.00739] 0.07740057 -0.05685743 0.24995110
## (0.00739,9.92] -0.76203007 0.89974533 -0.65072726
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.0756633057 0.79846732 -0.778188769
## indus 0.0229246969 -0.32396654 0.212664714
## chas -0.0685920761 -0.12113528 0.012402285
## nox 0.4561968704 -0.57309813 -1.411060603
## rm -0.0954404100 -0.12296589 -0.233891079
## age 0.3283385181 -0.27446688 -0.248983339
## dis 0.0003193076 -0.25220887 0.002261836
## rad 3.0451446599 0.78518676 -0.110064367
## tax -0.0693849676 0.10257578 0.588177545
## ptratio 0.0896426859 0.09131930 -0.180942183
## black -0.1219054909 -0.01916257 0.106545965
## lstat 0.1822179145 -0.19665151 0.447023960
## medv 0.1858741268 -0.30998949 -0.145951415
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9447 0.0409 0.0144
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# target classes as numeric
classes <- as.numeric(train$crime)
# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 2)
In linear discriminant analysis, the highest crime class almost solely
occupies one cluster and the other crime classes are mixed in the other
cluster. The most influential line separators seem to be accessibility
to radial highways, proportion of residential land zoned for lots over
2323 m2 and nitrogen oxides concentration, most likely implying
differences between rural and urban environments. The first discriminant
function separates 94.7% of the population.
# PREDICTING WITH THE LDA MODEL
# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class) %>% addmargins
## predicted
## correct [-0.419,-0.411] (-0.411,-0.39] (-0.39,0.00739] (0.00739,9.92]
## [-0.419,-0.411] 10 12 2 0
## (-0.411,-0.39] 3 16 4 0
## (-0.39,0.00739] 1 9 18 0
## (0.00739,9.92] 0 0 0 27
## Sum 14 37 24 27
## predicted
## correct Sum
## [-0.419,-0.411] 24
## (-0.411,-0.39] 23
## (-0.39,0.00739] 28
## (0.00739,9.92] 27
## Sum 102
The LDA model works reasonably well: it correctly predicts the crime class in 69 (68%) towns.
# DISTANCES AND CLUSTERING
# access library ggplot2
library(ggplot2)
# scale dataset
boston_scaled2 <- data.frame(scale(Boston))
# euclidean distance matrix
dist_eu <- dist(boston_scaled2)
# look at the summary of the distances
summary(dist_eu)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1343 3.4625 4.8241 4.9111 6.1863 14.3970
# manhattan distance matrix
dist_man <- dist(boston_scaled2, method = "manhattan")
# look at the summary of the distances
summary(dist_man)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2662 8.4832 12.6090 13.5488 17.7568 48.8618
# k-means clustering
km <- kmeans(boston_scaled2, centers = 3)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
pairs(boston_scaled2[1:7], col = km$cluster)
pairs(boston_scaled2[8:14], col = km$cluster)
set.seed(123)
# determine the number of clusters
k_max <- 10
# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled2, k)$tot.withinss})
# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')
# k-means clustering
km <- kmeans(boston_scaled2, centers = 2)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km$cluster)
pairs(boston_scaled2[1:7], col = km$cluster)
pairs(boston_scaled2[8:14], col = km$cluster)
Scaling reduces the distances and skewness of their distribution. For
k-means clustering, 2 centers seem to be the best choice. One cluster
seem to be associated with low crime rate, low proportion of non-retail
business acres, low nitrogen oxides concentration, younger age, high
mean distance to main employment centres, low accessibility to radial
highways, low tax rate, high difference in race proportions and high
median value of owner-occupied homes. This may refer to that clustering
identifies rural areas or otherwise less densely populated areas
vs. urban areas.
# BONUS SECTION: MORE ON K-MEANS CLUSTERING AND LINEAR DISCRIMINANT ANALYSIS
# k-means clustering
km2 <- kmeans(boston_scaled2, centers = 6)
# plot the Boston dataset with clusters
pairs(boston_scaled2, col = km2$cluster)
pairs(boston_scaled2[1:7], col = km2$cluster)
pairs(boston_scaled2[8:14], col = km2$cluster)
# linear discriminant analysis
lda.fit2 <- lda(km2$cluster ~ ., data = boston_scaled2)
# print the lda.fit object
lda.fit2
## Call:
## lda(km2$cluster ~ ., data = boston_scaled2)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6
## 0.06719368 0.12252964 0.24505929 0.18972332 0.29051383 0.08498024
##
## Group means:
## crim zn indus chas nox rm age
## 1 -0.1985497 -0.2602436 0.2799956 3.6647712 0.3830784 0.2756681 0.3721322
## 2 -0.4141953 2.3322773 -1.1788641 -0.2088275 -1.1887944 0.7188485 -1.4216721
## 3 1.1156495 -0.4872402 1.0149946 -0.2723291 0.9916473 -0.4276828 0.7515952
## 4 -0.3269211 -0.4816572 0.6406213 -0.2723291 0.4664828 -0.5082323 0.7668344
## 5 -0.3977957 -0.1563279 -0.5990216 -0.2723291 -0.6696819 -0.1686625 -0.6751063
## 6 -0.3732407 -0.1422288 -0.8310017 -0.2723291 -0.2005297 1.6901170 0.1841367
## dis rad tax ptratio black lstat
## 1 -0.4033382 0.001081444 -0.0975633 -0.39245849 0.17154271 -0.1643525
## 2 1.6223203 -0.666968948 -0.5535972 -0.82951564 0.35193833 -0.9665363
## 3 -0.8170870 1.659602895 1.5294129 0.80577843 -0.81154619 0.9129958
## 4 -0.5732656 -0.602637816 -0.1649468 0.21059409 0.04824716 0.5397714
## 5 0.5672190 -0.575610755 -0.6877857 -0.05330275 0.36267528 -0.3956902
## 6 -0.3232389 -0.511800977 -0.8155263 -1.10522083 0.34963036 -0.9616241
## medv
## 1 0.573340910
## 2 0.799806470
## 3 -0.771340259
## 4 -0.505310108
## 5 0.007601824
## 6 1.719927960
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5
## crim 0.034508803 0.05797085 -0.16103219 0.152322830 0.01702637
## zn 0.263383770 -0.23309900 -1.54525759 -1.099430119 0.78886901
## indus -0.073596625 0.35511069 0.27621202 -0.741302164 -0.03049072
## chas -5.833329077 -0.28810784 -0.25366217 -0.153849858 -0.20101832
## nox 0.012019761 -0.33200265 0.08267993 -0.138256930 0.38821434
## rm 0.077523647 0.02092389 -0.05279584 0.346603115 0.54639625
## age -0.078503080 0.10642631 0.50741330 -0.163867301 0.79431423
## dis -0.055475885 -0.36793105 -0.33307268 -0.006313437 -0.37843684
## rad -0.319867249 3.41804068 -1.06679562 1.629076215 -0.94341422
## tax 0.011360375 -0.30248228 -0.48448257 -0.924010759 0.59988910
## ptratio -0.010141142 -0.03367258 0.16740977 -0.504498820 0.12595753
## black -0.007393166 -0.08886704 0.01924355 0.045161817 -0.07202500
## lstat 0.123902259 0.13351602 0.01047765 0.007966329 0.40940453
## medv 0.203273034 -0.38745314 -0.07192725 0.516887596 0.76079378
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5
## 0.6292 0.2589 0.0730 0.0230 0.0159
# the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
# plot the lda results
plot(lda.fit2, dimen = 2, col = km2$cluster, pch = km2$cluster)
lda.arrows(lda.fit2, myscale = 2)
Linear discriminant analysis produces three clusters, in which one
cluster is occupied by k-means cluster 1, another by k-means cluster 3,
and the third cluster contains the rest k-means clusters. The most
influential line separators seem to be accessibility to radial highways
and Charles River dummy variable. The first discriminant function
separates 62.9% and the second one 25.9% of the towns.
# SUPER-BONUS SECTION: 3D PLOTTING
# access library plotly
library(plotly)
# define a new object
model_predictors <- dplyr::select(train, -crime)
# check the dimensions
dim(model_predictors)
## [1] 404 13
dim(lda.fit$scaling)
## [1] 13 3
# matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
# k-means clustering
km3 <- kmeans(matrix_product, centers = 2)
km4 <- km3$cluster
# Create 3D plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers')
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = km4)
The first matrix plot shows two clusters that are well separated from each other. In the second plot, one cluster is mainly occupied by the highest crime class. In the third plot, the clusters coincide fully with the k-means clustering with two centers.
knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Dec 7 15:33:36 2022"
# BRING IN THE DATA
# access packages readr and dplyr
library(readr); library(dplyr)
# Set the working directory
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")
human <- read.csv("human.csv")
# add countries as rownames
rownames(human) <- human$X
# remove the Country variable
human <- select(human, -X)
# EXPLORING THE DATA
# Access GGally and corrplot
library(GGally)
library(corrplot)
# visualize the variables
p <- ggpairs(human, progress = FALSE) # for some reason ggpairs produces error in knitting...
p
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% round(digits = 3)
## edu2ratio jobratio explife expyrsed GNI MMR ABR propparl
## edu2ratio 1.000 0.010 0.576 0.593 0.430 -0.661 -0.529 0.079
## jobratio 0.010 1.000 -0.140 0.047 -0.022 0.240 0.120 0.250
## explife 0.576 -0.140 1.000 0.789 0.627 -0.857 -0.729 0.170
## expyrsed 0.593 0.047 0.789 1.000 0.624 -0.736 -0.704 0.206
## GNI 0.430 -0.022 0.627 0.624 1.000 -0.495 -0.557 0.089
## MMR -0.661 0.240 -0.857 -0.736 -0.495 1.000 0.759 -0.089
## ABR -0.529 0.120 -0.729 -0.704 -0.557 0.759 1.000 -0.071
## propparl 0.079 0.250 0.170 0.206 0.089 -0.089 -0.071 1.000
Distributions of the variables are rather skewed and the scales vary much from one variable to another. The variables are reasonably highly correlated with each other, except for labor participation ratio between genders (jobratio) and parliament representation (propparl) which correlated least with the other factors.
# PRINCIPAL COMPONENT ANALYSIS
# perform principal component analysis (with the SVD method)
pca <- prcomp(human)
# summary of pca
summary(pca)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## Standard deviation 1.854e+04 185.5219 25.19 11.45 3.766 1.566 0.1912 0.1591
## Proportion of Variance 9.999e-01 0.0001 0.00 0.00 0.000 0.000 0.0000 0.0000
## Cumulative Proportion 9.999e-01 1.0000 1.00 1.00 1.000 1.000 1.0000 1.0000
# draw a biplot of the principal component representation and the original variables
biplot(pca, choices = 1:2, col = c("grey40", "deeppink2"), cex = c(0.8, 1))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length =
## arrow.len): zero-length arrow is of indeterminate angle and so skipped
# standardize the variables
human_std <- scale(human)
# print out summaries of the standardized variables
summary(human_std)
## edu2ratio jobratio explife expyrsed
## Min. :-2.8189 Min. :-2.6247 Min. :-2.7188 Min. :-2.7378
## 1st Qu.:-0.5233 1st Qu.:-0.5484 1st Qu.:-0.6425 1st Qu.:-0.6782
## Median : 0.3503 Median : 0.2316 Median : 0.3056 Median : 0.1140
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5958 3rd Qu.: 0.7350 3rd Qu.: 0.6717 3rd Qu.: 0.7126
## Max. : 2.6646 Max. : 1.6632 Max. : 1.4218 Max. : 2.4730
## GNI MMR ABR propparl
## Min. :-0.9193 Min. :-0.6992 Min. :-1.1325 Min. :-1.8203
## 1st Qu.:-0.7243 1st Qu.:-0.6496 1st Qu.:-0.8394 1st Qu.:-0.7409
## Median :-0.3013 Median :-0.4726 Median :-0.3298 Median :-0.1403
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.3712 3rd Qu.: 0.1932 3rd Qu.: 0.6030 3rd Qu.: 0.6127
## Max. : 5.6890 Max. : 4.4899 Max. : 3.8344 Max. : 3.1850
# compute the correlation matrix and visualize it with corrplot
cor(human_std) %>% round(digits = 3)
## edu2ratio jobratio explife expyrsed GNI MMR ABR propparl
## edu2ratio 1.000 0.010 0.576 0.593 0.430 -0.661 -0.529 0.079
## jobratio 0.010 1.000 -0.140 0.047 -0.022 0.240 0.120 0.250
## explife 0.576 -0.140 1.000 0.789 0.627 -0.857 -0.729 0.170
## expyrsed 0.593 0.047 0.789 1.000 0.624 -0.736 -0.704 0.206
## GNI 0.430 -0.022 0.627 0.624 1.000 -0.495 -0.557 0.089
## MMR -0.661 0.240 -0.857 -0.736 -0.495 1.000 0.759 -0.089
## ABR -0.529 0.120 -0.729 -0.704 -0.557 0.759 1.000 -0.071
## propparl 0.079 0.250 0.170 0.206 0.089 -0.089 -0.071 1.000
# perform principal component analysis (with the SVD method)
pca_std <- prcomp(human_std)
# summary of pca
summary(pca_std)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.0708 1.1397 0.87505 0.77886 0.66196 0.53631 0.45900
## Proportion of Variance 0.5361 0.1624 0.09571 0.07583 0.05477 0.03595 0.02634
## Cumulative Proportion 0.5361 0.6984 0.79413 0.86996 0.92473 0.96069 0.98702
## PC8
## Standard deviation 0.32224
## Proportion of Variance 0.01298
## Cumulative Proportion 1.00000
# draw a biplot of the principal component representation and the original variables
biplot(pca_std, choices = 1:2, col = c("grey40", "deeppink2"), cex = c(0.8, 1))
As expected, scaling reduces variances. Without scaling, PCA produces
components the first of which seems to explain nearly all of the
variation in the dataset and is occupied solely with GNI which has the
largest variance among the variables. After scaling, PCA produces
components the first four of which cumulatively explain 87% of the
variation in the data. The arrows present nicely the observed
correlations between variables.
# create and print out a summary of pca_std
s <- summary(pca_std)
# rounded percentanges of variance captured by each PC
pca_pr <- round(100*s$importance[2, ], digits = 1)
# print out the percentages of variance
pca_std
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
##
## Rotation (n x k) = (8 x 8):
## PC1 PC2 PC3 PC4 PC5
## edu2ratio -0.35664370 0.03796058 -0.24223089 0.62678110 -0.5983585
## jobratio 0.05457785 0.72432726 -0.58428770 0.06199424 0.2625067
## explife -0.44372240 -0.02530473 0.10991305 -0.05834819 0.1628935
## expyrsed -0.42766720 0.13940571 -0.07340270 -0.07020294 0.1659678
## GNI -0.35048295 0.05060876 -0.20168779 -0.72727675 -0.4950306
## MMR 0.43697098 0.14508727 -0.12522539 -0.25170614 -0.1800657
## ABR 0.41126010 0.07708468 0.01968243 0.04986763 -0.4672068
## propparl -0.08438558 0.65136866 0.72506309 0.01396293 -0.1523699
## PC6 PC7 PC8
## edu2ratio 0.17713316 0.05773644 0.16459453
## jobratio -0.03500707 -0.22729927 -0.07304568
## explife -0.42242796 -0.43406432 0.62737008
## expyrsed -0.38606919 0.77962966 -0.05415984
## GNI 0.11120305 -0.13711838 -0.16961173
## MMR 0.17370039 0.35380306 0.72193946
## ABR -0.76056557 -0.06897064 -0.14335186
## propparl 0.13749772 0.00568387 -0.02306476
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 53.6 16.2 9.6 7.6 5.5 3.6 2.6 1.3
# create object pc_lab to be used as axis labels...
paste0(names(pca_pr), " (", pca_pr, "%)")
## [1] "PC1 (53.6%)" "PC2 (16.2%)" "PC3 (9.6%)" "PC4 (7.6%)" "PC5 (5.5%)"
## [6] "PC6 (3.6%)" "PC7 (2.6%)" "PC8 (1.3%)"
# create object pc_lab2 to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
lbl <- c("Underdevelopment level:", "Gender equality:", "", "", "", "", "", "")
pc_lab2 <- paste0(lbl, " ", pc_lab)
# draw a biplot again
biplot(pca_std, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab2[1], ylab = pc_lab2[2])
The PC1 seems to be associated with the development level of the
country: the main variables affecting it are maternal mortality rate,
adolescent birth rate, life expectancy and expected years in secondary
education. On the other hand, the PC2 seems to reflect gender inequality
being mainly affected by labor ratio of genders and the parliamentary
representation. It seems that the ratio of secondary education between
genders is for some reason more related to the development level of a
country than gender equality reflected in the labor market and
governing.
# MULTIPLE CORRESPONDENCE ANALYSIS
# accessing libraries dplyr, tidyr and ggplot2
library(dplyr)
library(tidyr)
library(ggplot2)
# load tea data
tea_time <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/Helsinki-Open-Data-Science/master/datasets/tea_time.csv",
sep = ",", header = T)
# create a numerical version of tea data
tea_timen <- tea_time
tea_timen$Tea[tea_timen$Tea == "black"] <- 1
tea_timen$Tea[tea_timen$Tea == "Earl Grey"] <- 2
tea_timen$Tea[tea_timen$Tea == "green"] <- 3
tea_timen$How[tea_timen$How == "alone"] <- 1
tea_timen$How[tea_timen$How == "lemon"] <- 2
tea_timen$How[tea_timen$How == "milk"] <- 3
tea_timen$How[tea_timen$How == "other"] <- 4
tea_timen$how[tea_timen$how == "tea bag"] <- 1
tea_timen$how[tea_timen$how == "tea bag+unpackaged"] <- 2
tea_timen$how[tea_timen$how == "unpackaged"] <- 3
tea_timen$sugar[tea_timen$sugar == "sugar"] <- 1
tea_timen$sugar[tea_timen$sugar == "No.sugar"] <- 2
tea_timen$where[tea_timen$where == "chain store"] <- 1
tea_timen$where[tea_timen$where == "chain store+tea shop"] <- 2
tea_timen$where[tea_timen$where == "tea shop"] <- 3
tea_timen$lunch[tea_timen$lunch == "lunch"] <- 1
tea_timen$lunch[tea_timen$lunch == "Not.lunch"] <- 2
tea_timen <- as.data.frame(apply(tea_timen, 2, as.numeric))
# look at the summaries and structure of the data
str(tea_time)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : chr "black" "black" "Earl Grey" "Earl Grey" ...
## $ How : chr "alone" "milk" "alone" "alone" ...
## $ how : chr "tea bag" "tea bag" "tea bag" "tea bag" ...
## $ sugar: chr "sugar" "No.sugar" "No.sugar" "sugar" ...
## $ where: chr "chain store" "chain store" "chain store" "chain store" ...
## $ lunch: chr "Not.lunch" "Not.lunch" "Not.lunch" "Not.lunch" ...
summary(tea_time)
## Tea How how sugar
## Length:300 Length:300 Length:300 Length:300
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
## where lunch
## Length:300 Length:300
## Class :character Class :character
## Mode :character Mode :character
str(tea_timen)
## 'data.frame': 300 obs. of 6 variables:
## $ Tea : num 1 1 2 2 2 2 2 1 2 1 ...
## $ How : num 1 3 1 1 1 1 1 3 3 1 ...
## $ how : num 1 1 1 1 1 1 1 1 2 2 ...
## $ sugar: num 1 2 2 1 2 2 2 2 2 2 ...
## $ where: num 1 1 1 1 1 1 1 1 2 2 ...
## $ lunch: num 2 2 2 2 2 2 2 2 2 2 ...
summary(tea_timen)
## Tea How how sugar where
## Min. :1.000 Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.00
## 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:1.000 1st Qu.:1.000 1st Qu.:1.00
## Median :2.000 Median :1.00 Median :1.000 Median :2.000 Median :1.00
## Mean :1.863 Mean :1.62 Mean :1.553 Mean :1.517 Mean :1.46
## 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:2.000 3rd Qu.:2.000 3rd Qu.:2.00
## Max. :3.000 Max. :4.00 Max. :3.000 Max. :2.000 Max. :3.00
## lunch
## Min. :1.000
## 1st Qu.:2.000
## Median :2.000
## Mean :1.853
## 3rd Qu.:2.000
## Max. :2.000
# visualize the dataset
pivot_longer(tea_time, cols = everything()) %>%
ggplot(aes(value)) + facet_wrap("name", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
p <- ggpairs(tea_timen, progress = FALSE)
p
# multiple correspondence analysis
library(FactoMineR)
mca <- MCA(tea_time[, 1:5], graph = FALSE)
# summary of the models
summary(mca)
##
## Call:
## MCA(X = tea_time[, 1:5], graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6 Dim.7
## Variance 0.335 0.309 0.257 0.224 0.198 0.184 0.172
## % of var. 16.762 15.441 12.866 11.181 9.924 9.188 8.625
## Cumulative % of var. 16.762 32.204 45.069 56.250 66.174 75.362 83.987
## Dim.8 Dim.9 Dim.10
## Variance 0.141 0.105 0.075
## % of var. 7.031 5.249 3.734
## Cumulative % of var. 91.017 96.266 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | -0.325 0.105 0.088 | -0.288 0.090 0.069 | 0.305
## 2 | -0.259 0.067 0.037 | -0.074 0.006 0.003 | 0.789
## 3 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 4 | -0.580 0.334 0.481 | -0.328 0.116 0.154 | -0.290
## 5 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 6 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 7 | -0.403 0.162 0.242 | -0.283 0.086 0.119 | 0.217
## 8 | -0.259 0.067 0.037 | -0.074 0.006 0.003 | 0.789
## 9 | 0.155 0.024 0.012 | 0.974 1.025 0.461 | -0.005
## 10 | 0.521 0.270 0.142 | 0.845 0.770 0.373 | 0.612
## ctr cos2
## 1 0.120 0.078 |
## 2 0.806 0.343 |
## 3 0.061 0.070 |
## 4 0.109 0.120 |
## 5 0.061 0.070 |
## 6 0.061 0.070 |
## 7 0.061 0.070 |
## 8 0.806 0.343 |
## 9 0.000 0.000 |
## 10 0.485 0.196 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr cos2
## black | 0.473 3.294 0.073 4.681 | 0.198 0.627 0.013
## Earl Grey | -0.265 2.689 0.126 -6.147 | 0.088 0.319 0.014
## green | 0.487 1.558 0.029 2.962 | -0.956 6.515 0.113
## alone | -0.017 0.011 0.001 -0.405 | -0.251 2.643 0.117
## lemon | 0.668 2.926 0.055 4.059 | 0.458 1.497 0.026
## milk | -0.338 1.428 0.030 -3.010 | 0.220 0.659 0.013
## other | 0.287 0.148 0.003 0.873 | 2.207 9.465 0.151
## tea bag | -0.607 12.466 0.482 -12.008 | -0.336 4.134 0.147
## tea bag+unpackaged | 0.348 2.261 0.055 4.063 | 1.000 20.281 0.456
## unpackaged | 1.959 27.484 0.524 12.511 | -1.025 8.173 0.143
## v.test Dim.3 ctr cos2 v.test
## black 1.961 | 1.045 20.945 0.358 10.342 |
## Earl Grey 2.033 | -0.462 10.689 0.386 -10.737 |
## green -5.813 | 0.360 1.110 0.016 2.190 |
## alone -5.905 | 0.150 1.136 0.042 3.534 |
## lemon 2.787 | -1.561 20.842 0.301 -9.491 |
## milk 1.963 | 0.093 0.141 0.002 0.828 |
## other 6.712 | 1.826 7.773 0.103 5.552 |
## tea bag -6.637 | 0.069 0.211 0.006 1.367 |
## tea bag+unpackaged 11.677 | -0.050 0.061 0.001 -0.586 |
## unpackaged -6.548 | -0.196 0.357 0.005 -1.249 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## Tea | 0.126 0.115 0.421 |
## How | 0.076 0.220 0.385 |
## how | 0.708 0.503 0.008 |
## sugar | 0.065 0.004 0.412 |
## where | 0.701 0.702 0.061 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali", graph.type = "classic")
The tea_time variables are rather weakly associated with each other. The
only exception is the association between how tea is used/purchased (as
a tea bag or unpacked) and where it is bought (from a chain store or a
tea shop). In the MCA factor map, the classes stay mostly close to the
origo implying that very much distinction cannot be found based on these
variables. Yet, it hints that some users tend to use their tea as green
bought unpackaged from a tea shop, some like their tea as Earl Grey
(bought as tea bags from chain stores) with sugar and milk and some like
it black with lemon and no sugar. The classes ‘other’ and ‘chain store +
tea shop’ and ‘tea bag + unpackaged’ seem to be distinctive from others
but are rather impossible to interpret due to their mixed natures.
knitr::opts_chunk$set(message = FALSE)
date()
## [1] "Wed Dec 7 15:33:56 2022"
# BRING IN THE DATA
# access packages readr and dplyr
library(readr); library(dplyr)
# set the working directory
setwd("C:/Users/yona2/Desktop/FT/IODS2022/RStudio/IODS-project/data")
# load the datasets
RATSL <- read.csv("RATSL.csv")
BPRSL <- read.csv("BPRSL.csv")
# check structure
str(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : int 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
str(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: int 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : int 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
# factor variables ID and Group in RATSL
RATSL$ID <- factor(RATSL$ID)
RATSL$Group <- factor(RATSL$Group)
# factor variables treatment and subject in BPRSL
BPRSL$treatment <- factor(BPRSL$treatment)
BPRSL$subject <- factor(BPRSL$subject)
# check structure again
str(RATSL); summary(RATSL)
## 'data.frame': 176 obs. of 5 variables:
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr "WD1" "WD1" "WD1" "WD1" ...
## $ Weight: int 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int 1 1 1 1 1 1 1 1 1 1 ...
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
str(BPRSL); summary(BPRSL)
## 'data.frame': 360 obs. of 5 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
## treatment subject weeks bprs week
## 1:180 1 : 18 Length:360 Min. :18.00 Min. :0
## 2:180 2 : 18 Class :character 1st Qu.:27.00 1st Qu.:2
## 3 : 18 Mode :character Median :35.00 Median :4
## 4 : 18 Mean :37.66 Mean :4
## 5 : 18 3rd Qu.:43.00 3rd Qu.:6
## 6 : 18 Max. :95.00 Max. :8
## (Other):252
The RATS data contain information on 16 rats that were assigned to 3 treatment groups. Weight was measured 11 times, at baseline on day 1 (WD1) and about once a week thereafter up to day 64 (WD64). During the study, the weight varied from 225 to 628 g.
The BPRS data contain information on 40 subjects that were assigned to 2 treatment groups, 20 subjects in each. Response to treatment was measured 9 times by the Brief psychiatric rating scale (BPRS), at baseline (week0) and weekly thereafter up to 8 weeks (week1-week8). During the study, BPRS varied from 18 to 95.
# ANALYSING RATS DATA
# access the package ggplot2
library(ggplot2)
# draw the plot
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, times=2)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
# standardise the variable bprs
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdWeight = (Weight - mean(Weight))/sd(Weight)) %>%
ungroup()
# check structure and summary anew
str(RATSL); summary(RATSL)
## tibble [176 × 6] (S3: tbl_df/tbl/data.frame)
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Group : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ WD : chr [1:176] "WD1" "WD1" "WD1" "WD1" ...
## $ Weight : int [1:176] 240 225 245 260 255 260 275 245 410 405 ...
## $ Time : int [1:176] 1 1 1 1 1 1 1 1 1 1 ...
## $ stdWeight: num [1:176] -1.001 -1.12 -0.961 -0.842 -0.882 ...
## ID Group WD Weight Time
## 1 : 11 1:88 Length:176 Min. :225.0 Min. : 1.00
## 2 : 11 2:44 Class :character 1st Qu.:267.0 1st Qu.:15.00
## 3 : 11 3:44 Mode :character Median :344.5 Median :36.00
## 4 : 11 Mean :384.5 Mean :33.55
## 5 : 11 3rd Qu.:511.2 3rd Qu.:50.00
## 6 : 11 Max. :628.0 Max. :64.00
## (Other):110
## stdWeight
## Min. :-1.1618
## 1st Qu.:-0.9084
## Median :-0.2062
## Mean : 0.0000
## 3rd Qu.: 0.9235
## Max. : 1.6330
##
# plot again with the standardized bprs
library(ggplot2)
ggplot(RATSL, aes(x = Time, y = stdWeight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:8, times=2)) +
facet_grid(. ~ Group, labeller = label_both) +
scale_y_continuous(name = "standardized bprs")
Standardization decreases considerably the variability of weight as expected. In the plots, tracking can be detected and the phenomenon is somewhat clearer in the standardized plots. It is also notable that the variance of weight seems to decrease somewhat along the follow-up in all treatment groups even though it is less evident in group 1. There seems to be weight gain during time in all groups and especially in groups 2 and 3. In group 2, one rat seems to be in a league of its own.
# number of subjects (per group):
n <- 8 # note: group 1 has 8 rats, and groups 2 and 3 4 rats
library(tidyr)
# summary data with mean and standard error of weight by group and time
RATSS <- RATSL %>%
group_by(Group, Time) %>%
summarise(mean = mean(Weight), se = sd(Weight)/sqrt(n)) %>%
ungroup()
# correction to se in groups 2 and 3
group <- RATSS$Group %>% as.numeric
RATSS$se <- ifelse(group > 1, sqrt(2)*RATSS$se, RATSS$se)
# plot the mean profiles
ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=3) +
scale_shape_manual(values = c(1,2,3)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
In the plot, mean (SE) values of weight are shown at each timepoint by treatment group. The variances are larger in groups 2 and 3 compared with group 1. Variance may decrease a little in group 3 over time but in other groups there is not much change. The seems to be weight gain in all groups but the slopes look bigger in groups 2 and 3.
# create a summary data by treatment and subject with mean as the summary variable (ignoring baseline week 0)
RATSL10S <- RATSL %>%
filter(Time > 1) %>%
group_by(Group, ID) %>%
summarise(mean=mean(Weight)) %>%
ungroup()
# check structure and summary of the new object
str(RATSL10S); summary(RATSL10S)
## tibble [16 × 3] (S3: tbl_df/tbl/data.frame)
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean : num [1:16] 263 239 262 267 271 ...
## Group ID mean
## 1:8 1 : 1 Min. :238.9
## 2:4 2 : 1 1st Qu.:267.4
## 3:4 3 : 1 Median :360.1
## 4 : 1 Mean :386.3
## 5 : 1 3rd Qu.:505.4
## 6 : 1 Max. :594.0
## (Other):10
# draw a boxplot of the mean versus treatment
ggplot(RATSL10S, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 1-64")
# create a new data by filtering the outlier and adjust the ggplot code the draw the plot again with the new data
RATSL10S1 <- RATSL10S %>% filter(mean < 580)
# check structure and summary of the object
str(RATSL10S1); summary(RATSL10S1)
## tibble [15 × 3] (S3: tbl_df/tbl/data.frame)
## $ Group: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 2 2 ...
## $ ID : Factor w/ 16 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ mean : num [1:15] 263 239 262 267 271 ...
## Group ID mean
## 1:8 1 :1 Min. :238.9
## 2:3 2 :1 1st Qu.:267.4
## 3:4 3 :1 Median :276.2
## 4 :1 Mean :372.5
## 5 :1 3rd Qu.:476.4
## 6 :1 Max. :542.2
## (Other):9
# draw a boxplot of the mean versus treatment again
ggplot(RATSL10S1, aes(x = Group, y = mean)) +
geom_boxplot() +
stat_summary(fun = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight), days 1-64")
When comparing the overall means it can be seen that there is an outlier in group 2 (mean weight > 590). Filtering the outlier changes the boxplot of group 2. There is a remarkable difference between groups but it is mostly explained by the difference at baseline.
# t-test can be used only for 2 groups, so fit a basic linear model instead
fit1 <- lm(mean ~ Group, data = RATSL10S1)
summary(fit1)
##
## Call:
## lm(formula = mean ~ Group, data = RATSL10S1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -32.300 -2.575 3.400 8.800 14.700
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 265.025 5.086 52.11 1.64e-15 ***
## Group2 187.375 9.738 19.24 2.19e-10 ***
## Group3 262.475 8.809 29.80 1.28e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14.38 on 12 degrees of freedom
## Multiple R-squared: 0.9882, Adjusted R-squared: 0.9862
## F-statistic: 501.8 on 2 and 12 DF, p-value: 2.721e-12
# compute the analysis of variance table for the fitted model with anova()
anova(fit1)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 207659 103830 501.81 2.721e-12 ***
## Residuals 12 2483 207
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# read the RATS data in order to get baseline values
RATS <- read.table("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt", header = TRUE, sep = '\t')
# add the baseline from the original data as a new variable to the summary data
RATSL10S2 <- RATSL10S %>%
mutate(baseline = RATS$WD1) %>% # note: the number of observations needs to be the same, hence filtering anew
filter(mean < 590)
# fit the linear model with the mean as the response
fit2 <- lm(mean ~ baseline + Group, data = RATSL10S2)
summary(fit2)
##
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSL10S2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.2818 -5.4247 -0.0667 5.7389 13.7887
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 119.8515 31.5069 3.804 0.002923 **
## baseline 0.5792 0.1251 4.630 0.000728 ***
## Group2 89.2652 22.0021 4.057 0.001892 **
## Group3 112.9571 32.7344 3.451 0.005421 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.749 on 11 degrees of freedom
## Multiple R-squared: 0.996, Adjusted R-squared: 0.9949
## F-statistic: 911.4 on 3 and 11 DF, p-value: 1.842e-13
# compute the analysis of variance table for the fitted model with anova()
anova(fit2)
## Analysis of Variance Table
##
## Response: mean
## Df Sum Sq Mean Sq F value Pr(>F)
## baseline 1 207817 207817 2714.9357 1.601e-14 ***
## Group 2 1483 742 9.6878 0.003748 **
## Residuals 11 842 77
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected, in the base model, the mean overall weight differs significantly between groups and the results are in concordance with the previous plot. The model explains 99% of the variation in mean overall weight. The mean overall weight in group 2 is 187 g higher and in group 262 g higher compared with group 1. But the result does not tell anything about the differences in weight gain between the groups during the experiment.
Baseline weight is a statistically significant addition to the model: it explains vastly more of the variation than treatment. Yet, there seems to be statistically significant effect by group: the mean overall weight in group 2 is still 89 g higher and in group 113 g higher compared with group 1.
Using the overall mean can be justified if we are interested in the treatment effect along the whole time and not just the end situation. If we are interested only in the final status, especially if we expect that status to be more or less permanent, the more accurate comparison might be to compare the weight status at the end.
Note of caution: in these analyses, the mean overall weight included also the weight at baseline. Because there were 11 measurements, the error it produces to the results is most likely unremarkable.
# # ANALYSING BPRS DATA
# Plot the BPRS data (note: each subject needs ty be identified by subject number)
treatment <- BPRSL$treatment %>% as.numeric
BPRSL$subject1 <- ifelse(treatment == 2, 20+as.numeric(BPRSL$subject), as.numeric(BPRSL$subject))
ggplot(BPRSL, aes(x = week, y = bprs, group = subject1, color = subject1)) +
geom_line(aes(linetype = treatment)) + scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "BPRS")
# Check the structure of the data
str(BPRSL)
## 'data.frame': 360 obs. of 6 variables:
## $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ subject : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ weeks : chr "week0" "week0" "week0" "week0" ...
## $ bprs : int 42 58 54 55 72 48 71 30 41 57 ...
## $ week : int 0 0 0 0 0 0 0 0 0 0 ...
## $ subject1 : num 1 2 3 4 5 6 7 8 9 10 ...
In the plot, there does not seem to be much difference between groups. Tracking seems prevalent and the variation of BPRS seems to decrease by time.
# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data = BPRSL)
# print out a summary of the model
summary(BPRS_reg)
##
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRSL)
##
## Residuals:
## Min 1Q Median 3Q Max
## -22.454 -8.965 -3.196 7.002 50.244
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.4539 1.3670 33.982 <2e-16 ***
## week -2.2704 0.2524 -8.995 <2e-16 ***
## treatment2 0.5722 1.3034 0.439 0.661
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared: 0.1851, Adjusted R-squared: 0.1806
## F-statistic: 40.55 on 2 and 357 DF, p-value: < 2.2e-16
# access library lme4
library(lme4)
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject1), data = BPRSL, REML = FALSE)
# print the model summary
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject1)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2582.9 2602.3 -1286.5 2572.9 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.27506 -0.59909 -0.06104 0.44226 3.15835
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject1 (Intercept) 97.39 9.869
## Residual 54.23 7.364
## Number of obs: 360, groups: subject1, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.3521 19.750
## week -2.2704 0.1503 -15.104
## treatment2 0.5722 3.2159 0.178
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.256
## treatment2 -0.684 0.000
In the fixed effects model, the observations are modeled as if they are independent from each other. Time (week) is a significant explanatory variable but treatment group is not. The model explains only 19% of variation in BPRS points. The fixed effects estimate for week (-2.3) implies 18 points decrease in 8 weeks which can be considered clinically significant. So time seems to heal but not the treatment.
In the random effects model, the variation between subjects is being taken into account. The fixed effects estimates remain the same but a part of the variance in BPRS points seems to be explained by random effects between subjects.
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject1), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject1)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.2 2550.4 -1254.6 2509.2 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4655 -0.5150 -0.0920 0.4347 3.7353
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject1 (Intercept) 167.827 12.955
## week 2.331 1.527 -0.67
## Residual 36.747 6.062
## Number of obs: 360, groups: subject1, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 45.9830 2.6470 17.372
## week -2.2704 0.2713 -8.370
## treatment2 1.5139 3.1392 0.482
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.545
## treatment2 -0.593 0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject1)
## BPRS_ref1: bprs ~ week + treatment + (week | subject1)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref 5 2582.9 2602.3 -1286.5 2572.9
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2 63.663 2 1.499e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Taking into account the dependence of BPRS points across time by each subject by adding week in the random effects model, there is still no treatment effect. On the other hand, there seems to be time dependence: when comparing models with and without week AIC decreases and logLik increases, the difference being statistically significant. There are minor changes in the fixed effects estimates.
These models show that large part of the variation between BPRS values can be explained by interindividual and intraindividual random variation. The impression that in this case time heals, not the treatment, becomes stronger.
# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + week * treatment + (week | subject1), data = BPRSL, REML = FALSE)
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + week * treatment + (week | subject1)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2523.5 2554.5 -1253.7 2507.5 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.4747 -0.5256 -0.0866 0.4435 3.7884
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject1 (Intercept) 164.204 12.814
## week 2.203 1.484 -0.66
## Residual 36.748 6.062
## Number of obs: 360, groups: subject1, 40
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.9840 16.047
## week -2.6283 0.3752 -7.006
## treatment2 -2.2911 4.2200 -0.543
## week:treatment2 0.7158 0.5306 1.349
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.668
## treatment2 -0.707 0.473
## wek:trtmnt2 0.473 -0.707 -0.668
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject1)
## BPRS_ref2: bprs ~ week + treatment + week * treatment + (week | subject1)
## npar AIC BIC logLik deviance Chisq Df Pr(>Chisq)
## BPRS_ref1 7 2523.2 2550.4 -1254.6 2509.2
## BPRS_ref2 8 2523.5 2554.6 -1253.7 2507.5 1.78 1 0.1821
# draw the plot of RATSL with the observed Weight values
ggplot(BPRSL, aes(x = week, y = bprs, group = subject1)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Observed BPRS points") +
theme(legend.position = "top")
# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)
# Create a new column fitted to RATSL
BPRSL <- BPRSL %>% mutate(Fitted = Fitted)
# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRSL, aes(x = week, y = Fitted, group = subject1)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
scale_y_continuous(name = "Fitted weight (grams)") +
theme(legend.position = "top")
Adding interaction term for week and treatment into the random effects model does not improve the model, AIC and logLik do not practically change at all. The model is able to fit the observations reasonably well if one is willing to believe that the relationship between BPRS and time is linear; yet, a second-degree polynomial model might be better.
date()
## [1] "Wed Dec 7 15:34:06 2022"
This IODS2022 has been a very helpful crash course for me into the wide world of R and it has served its purpose well. I have long been planning to start using R but there has been a definite threshold there. Easy to guess that I have not used R Markdown before either but it seems to be a rather nice tool. I guess I will try and experiment on creating manuscripts with it later on. I am deeply grateful to all those who have given and are giving their skill, knowledge, energy and time into developing this kind of open source tools and programs.
As a researcher, I am familiar with statistics and statistical tools such as SPSS and SAS but so far I have not really worked with data with repeated measures. In that sense, it has been nice to get a better touch on analysis methods such as principal component analysis and otherwise modeling repeated measures data.
I am not sure if I ever again will take an advantage of Git or GitHub but it is indeed a handy tool for version management and sharing data. Yet I am in general very hesitant to put anything to an open platform in a cloud. But I can see its merits when properly used. One caveat (learned by trial and error) is though that Git* does not like large files, or at least does not like to view them.
Even though this course is indeed very laborious the course material has been very well built up. By taking those steps one at a time while solving problems one is always bound to encounter with statistical programs like this has kept my frustration at a reasonable level. I feel like having a nice grasp on R and now it is for me only to continue learning.